To address the challenge of retaining talented employees, the
leadership team of Frito Lay have engaged DDS Analytics to conduct an
in-depth analysis aimed at uncovering the underlying factors behind
employee turnover within their organization. This study, leveraging data
from 870 employees across 36 distinct variables, is designed to yield
actionable insights to inform decision-making processes. By employing a
combination of visualizations, statistical methodologies, and the
creation of predictive models, the objective is to provide strategic
guidance and practical tools to enhance future strategies and
organizational policies.
Presentation
View the video presentation of this analysis: https://youtu.be/aY4CYfuHOf4.
This
imports necessary libraries.
library(RCurl)
library(tidyverse)
library(ggplot2)
library(DataExplorer)
library(Hmisc) #rcorr(), generates a correlation matrix
library(corrplot) #to plot the correlation matrix
library(gridExtra) #to arrange plots in a grid
library(RColorBrewer)
library(caret) #ConfusionMatrix()
library(e1071) #naiveBayes()
library(class) #knn
library(combinat)
library(patchwork)
library(GGally)
library(kableExtra) #kable() tables
library(pander) #pander() tables
library(dplyr) #loading this after Hmisc because of conflict in summarize? can use summarise instead
I import the primary data set from the cloud as
cs2. Then, I convert the numerical variables that are
Likert scale responses to factors, creating a new data frame,
cs2_conv.
# import data: CaseStudy2-data.csv from AWS S3 msdsds6306 bucket
cs2 = read.table(textConnection(getURL(
"https://s3.us-east-2.amazonaws.com/msdsds6306/CaseStudy2-data.csv"
)), sep=",", header=TRUE, stringsAsFactors = TRUE)
# save a copy of the data
# write.csv(cs2, file = "data/CaseStudy2-data.csv", row.names = FALSE)
# get a sense of the data
str(cs2)
'data.frame': 870 obs. of 36 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Age : int 32 40 35 32 24 27 41 37 34 34 ...
$ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
$ DailyRate : int 117 1308 200 801 567 294 1283 309 1333 653 ...
$ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
$ DistanceFromHome : int 13 14 18 1 2 10 5 10 10 10 ...
$ Education : int 4 3 2 4 1 2 5 4 4 4 ...
$ EducationField : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
$ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
$ EmployeeNumber : int 859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
$ EnvironmentSatisfaction : int 2 3 3 3 1 4 2 4 3 4 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
$ HourlyRate : int 73 44 60 48 32 32 90 88 87 92 ...
$ JobInvolvement : int 3 2 3 3 3 3 4 2 3 2 ...
$ JobLevel : int 2 5 3 3 1 3 1 2 1 2 ...
$ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
$ JobSatisfaction : int 4 3 4 4 4 1 3 4 3 3 ...
$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
$ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
$ MonthlyRate : int 9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
$ NumCompaniesWorked : int 2 1 2 1 1 1 2 2 1 1 ...
$ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
$ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
$ PercentSalaryHike : int 11 14 11 19 13 21 12 14 19 14 ...
$ PerformanceRating : int 3 3 3 3 3 4 3 3 3 3 ...
$ RelationshipSatisfaction: int 3 1 3 3 3 3 1 3 4 2 ...
$ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
$ StockOptionLevel : int 1 0 0 2 0 2 0 3 1 1 ...
$ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
$ TrainingTimesLastYear : int 3 2 2 3 2 4 5 5 2 3 ...
$ WorkLifeBalance : int 2 4 3 3 3 2 2 3 3 2 ...
$ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
$ YearsInCurrentRole : int 2 7 2 10 3 7 2 0 1 2 ...
$ YearsSinceLastPromotion : int 0 4 2 5 1 1 0 0 0 7 ...
$ YearsWithCurrManager : int 3 9 2 7 3 7 3 0 0 7 ...
df_summary = summary(cs2)
df_summary
ID Age Attrition BusinessTravel
Min. : 1.0 Min. :18.00 No :730 Non-Travel : 94
1st Qu.:218.2 1st Qu.:30.00 Yes:140 Travel_Frequently:158
Median :435.5 Median :35.00 Travel_Rarely :618
Mean :435.5 Mean :36.83
3rd Qu.:652.8 3rd Qu.:43.00
Max. :870.0 Max. :60.00
DailyRate Department DistanceFromHome Education
Min. : 103.0 Human Resources : 35 Min. : 1.000 Min. :1.000
1st Qu.: 472.5 Research & Development:562 1st Qu.: 2.000 1st Qu.:2.000
Median : 817.5 Sales :273 Median : 7.000 Median :3.000
Mean : 815.2 Mean : 9.339 Mean :2.901
3rd Qu.:1165.8 3rd Qu.:14.000 3rd Qu.:4.000
Max. :1499.0 Max. :29.000 Max. :5.000
EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
Human Resources : 15 Min. :1 Min. : 1.0 Min. :1.000
Life Sciences :358 1st Qu.:1 1st Qu.: 477.2 1st Qu.:2.000
Marketing :100 Median :1 Median :1039.0 Median :3.000
Medical :270 Mean :1 Mean :1029.8 Mean :2.701
Other : 52 3rd Qu.:1 3rd Qu.:1561.5 3rd Qu.:4.000
Technical Degree: 75 Max. :1 Max. :2064.0 Max. :4.000
Gender HourlyRate JobInvolvement JobLevel
Female:354 Min. : 30.00 Min. :1.000 Min. :1.000
Male :516 1st Qu.: 48.00 1st Qu.:2.000 1st Qu.:1.000
Median : 66.00 Median :3.000 Median :2.000
Mean : 65.61 Mean :2.723 Mean :2.039
3rd Qu.: 83.00 3rd Qu.:3.000 3rd Qu.:3.000
Max. :100.00 Max. :4.000 Max. :5.000
JobRole JobSatisfaction MaritalStatus MonthlyIncome
Sales Executive :200 Min. :1.000 Divorced:191 Min. : 1081
Research Scientist :172 1st Qu.:2.000 Married :410 1st Qu.: 2840
Laboratory Technician :153 Median :3.000 Single :269 Median : 4946
Manufacturing Director : 87 Mean :2.709 Mean : 6390
Healthcare Representative: 76 3rd Qu.:4.000 3rd Qu.: 8182
Sales Representative : 53 Max. :4.000 Max. :19999
(Other) :129
MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
Min. : 2094 Min. :0.000 Y:870 No :618 Min. :11.0
1st Qu.: 8092 1st Qu.:1.000 Yes:252 1st Qu.:12.0
Median :14074 Median :2.000 Median :14.0
Mean :14326 Mean :2.728 Mean :15.2
3rd Qu.:20456 3rd Qu.:4.000 3rd Qu.:18.0
Max. :26997 Max. :9.000 Max. :25.0
PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
Min. :3.000 Min. :1.000 Min. :80 Min. :0.0000
1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000
Median :3.000 Median :3.000 Median :80 Median :1.0000
Mean :3.152 Mean :2.707 Mean :80 Mean :0.7839
3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000
Max. :4.000 Max. :4.000 Max. :80 Max. :3.0000
TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000
1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000
Median :10.00 Median :3.000 Median :3.000 Median : 5.000
Mean :11.05 Mean :2.832 Mean :2.782 Mean : 6.962
3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:10.000
Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000
YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Min. : 0.000 Min. : 0.000 Min. : 0.00
1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.00
Median : 3.000 Median : 1.000 Median : 3.00
Mean : 4.205 Mean : 2.169 Mean : 4.14
3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.00
Max. :18.000 Max. :15.000 Max. :17.00
# convert numerical ranking variables to factors
numVars_to_fact = c("Education", "EnvironmentSatisfaction",
"JobInvolvement", "JobLevel", "JobSatisfaction",
"PerformanceRating", "RelationshipSatisfaction",
"StockOptionLevel", "WorkLifeBalance")
cs2_conv = cs2 %>%
mutate(across(all_of(numVars_to_fact), as.factor))
# check the conversion
# str(cs2_conv)
# get the proportion of attrition
summary(cs2_conv$Attrition)
No Yes
730 140
NoAttProb = summary(cs2_conv$Attrition)["No"] / sum(summary(cs2_conv$Attrition))
NoAttProb
No
0.8390805
AttProb = summary(cs2_conv$Attrition)["Yes"] / sum(summary(cs2_conv$Attrition))
AttProb
Yes
0.1609195
For the exploratory data analysis (EDA), first I
visualize the data. I loop through and plot nearly all the numerical
variables in histograms and frequency polygons by density grouped by
attrition category. I also loop through each categorical variable, find
the proportion of each level within each attrition category and plot
them as bar charts.
# plot the distributions of numerical variables with respect to attrition group
# select all variables except ID, EmployeeCount (all=1), and StandardHours (all=80))
cs2_conv_subset = cs2_conv %>% select(-ID, -EmployeeCount, -StandardHours)
# check the modification
# str(cs2_conv_subset)
# filter the numerical variables
numVars = sapply(cs2_conv_subset, is.numeric)
cs2_numerical = cs2_conv_subset[, numVars] #this does not include integers that were converted to factors
# check the result
# str(cs2_numerical)
# function to calculate binwidth (by Freedman-Diaconis rule, Bin Width=2*IQR/(cuberoot(n)))
# and flexibly set a binwidth for each variable
calculate_binwidth = function(data) {
IQR = quantile(data, 0.75) - quantile(data, 0.25)
n = length(data)
bw = 2 * IQR / (n^(1/3))
return(bw)
}
# iterate over numerical variables and create plots
dist_plots = list()
for (col in names(cs2_numerical)) {
# calculate binwidth for current variable
bw = calculate_binwidth(cs2_conv[[col]])
# create a histogram with density not frequency for each variable
hist_plot = cs2_conv %>%
ggplot(aes(x = .data[[col]], y = after_stat(density), fill = Attrition)) +
geom_histogram(binwidth = bw) +
facet_wrap(~Attrition)
labs(title = paste("Histogram of", col))
# save the histogram plot
# ggsave(filename = paste0("plots/EDA/", col, "_histogram.png"), plot = hist_plot, device = "png")
# create a frequency polygon plot for each variable
freqpoly_plot = cs2_conv %>%
ggplot(aes(x = .data[[col]], y = after_stat(density), color = Attrition)) +
geom_freqpoly(binwidth = bw) +
labs(title = paste("Frequency Polygon of", col))
# save the frequency polygon plot
# ggsave(filename = paste0("plots/EDA/", col, "_freqpoly.png"), plot = freqpoly_plot, device = "png")
dist_plots[[col]] = list(hist_plot, freqpoly_plot)
}
# display plots
for (col in names(cs2_numerical)) {
print(dist_plots[[col]][[1]]) #displays histogram
print(dist_plots[[col]][[2]]) #displays frequency polygon
}
# make bar charts to visualize categorical variables with respect to attrition group
# filter categorical variables (factors)
cs2_categorical = cs2_conv %>%
select(where(is.factor) & -Over18) #Over18(all=Y)
# check the result
# str(cs2_categorical)
# iterate over categorical variables and create bar plots
bar_plots = list()
for (col in names(cs2_categorical)) {
if (col != "Attrition") { #check if the column is not "Attrition"
# calculate proportions of each category within each level of Attrition
prop_cat_data = cs2_categorical %>%
group_by(Attrition = cs2_conv$Attrition, .data[[col]]) %>%
dplyr::summarize(count = n(), .groups = "drop") %>% #this gets rid of the warning message about grouping
group_by(Attrition) %>%
mutate(proportion = count / sum(count))
bar_plot = prop_cat_data %>%
ggplot(aes(x = .data[[col]], y = proportion, fill = Attrition)) +
geom_bar(position = "dodge", stat = "identity") +
labs(title = paste("Bar Chart of", col)) +
coord_flip()
# save the bar plot
# ggsave(filename = paste0("plots/EDA/", col, "_barplot.png"), plot = bar_plot, device = "png")
bar_plots[[col]] = bar_plot
}
}
# display plots if they exist (there is no plot for attrition)
for (col in names(cs2_categorical)) {
if (!is.null(bar_plots[[col]]))
print(bar_plots[[col]])
}
Many of the numerical features have right-skewed
distributions, e.g. salary, and the variables that are related to years
working.
With the plots above, I look for features that may have higher
attrition.
Numerical features with higher attrition may include the
following.
Categorical features with higher attrition may include the following.
Something odd, performance rating doesn’t seem to have much effect,
but attrition is higher in 4.
These factors could be correlated with age and higher skill: age,
education, environmental satisfaction(?), job involvement(?), job level,
job satisfaction(?), marital status, monthly income, overtime, stock
option level, total working years, years at company, years in current
role, years with current manager.
The next step of my EDA is to check for missing values
before proceeding with further analysis.
# are there missing values
missing_count = sum(is.na(cs2_conv) | cs2_conv == "")
writeLines(sprintf("There are %d missing values in the dataset.", missing_count))
There are 0 missing values in the dataset.
The data set appears to be complete.
This function generates a report to facilitate EDA. It
is good to know about and look over for ideas, though perhaps not as
useful as my step-by-step analysis.
create_report(cs2_conv, y = "Attrition")
For the numerical variables, I generate a matrix of
correlation coefficients and plot them. There is also a function to
generate a table of correlation and p-values, but only the first few
rows are displayed for the sake of space. I find the matrix to be
sufficient.
# compute correlation matrix and p-values
cor_mat = rcorr(as.matrix(cs2_numerical))
# This function makes a table of each variable combination with the correlation coefficient and p-value
# ++++++++++++++++++++++++++++
# flattenCorrMatrix
# ++++++++++++++++++++++++++++
# cormat : matrix of the correlation coefficients
# pmat : matrix of the correlation p-values
flattenCorrMatrix = function(cormat, pmat) {
ut <- upper.tri(cormat)
data.frame(
row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor =(cormat)[ut],
p = pmat[ut]
)
}
# call the function with extracted correlation and p-values
corrTable = flattenCorrMatrix(cor_mat$r, cor_mat$P)
kable(head(corrTable), caption = "correlation coefficients") %>% kable_styling(full_width = FALSE, position = "left")
| row | column | cor | p |
|---|---|---|---|
| Age | DailyRate | 0.0127367 | 0.7075463 |
| Age | DistanceFromHome | 0.0066471 | 0.8447809 |
| DailyRate | DistanceFromHome | 0.0149445 | 0.6597985 |
| Age | EmployeeNumber | 0.0104706 | 0.7577761 |
| DailyRate | EmployeeNumber | -0.0297344 | 0.3810454 |
| DistanceFromHome | EmployeeNumber | -0.0046415 | 0.8912620 |
# plot the correlations using corrplot
corrplot(cor_mat$r, type = "upper", order = "hclust", insig = "blank")
# If I include the p.mat argument I get an error.
# I think because pairs of the same variable create an NA.
From this matrix, there is evidence of correlation between some
numerical variables, including total working years with these below:
Next in the EDA, I look at each numerical variable for
statistically significant evidence that the mean value of that variable
differs between the populations of employees that stay or go. I perform
t-tests for each numerical feature with the null hypothesis that the
means of the attrition groups are equal.
# extract numerical column names from cs2_numerical
numerical_columns = colnames(cs2_numerical)
# initialize an empty list to store t-test results
t_test_results = list()
# initialize an empty dataframe to store summary statistics
summary_table = data.frame(variable = character(),
mean_Attrition_Yes = numeric(),
mean_Attrition_No = numeric(),
p_value = numeric(),
stringsAsFactors = FALSE)
# initialize empty lists to store variable names with p-value < 0.05 or < 0.001
significant_variables = list() #p-value < 0.05
super_sig_variables = list() # p-value < 0.001
# loop through each numerical variable
for (col in numerical_columns) {
# get data for the current column
data_col = cs2_conv[[col]]
# run a t-test
t_test_result = t.test(data_col ~ Attrition, data = cs2_conv)
# store t-test result in the list
t_test_results[[col]] = t_test_result
# extract relevant information and add to summary table
summary_table = summary_table %>%
add_row(variable = col,
mean_Attrition_Yes = mean(data_col[cs2_conv$Attrition == "Yes"], na.rm = TRUE),
mean_Attrition_No = mean(data_col[cs2_conv$Attrition == "No"], na.rm = TRUE),
p_value = t_test_result$p.value)
# check if p-value is less than 0.05
if (t_test_result$p.value < 0.05) {
significant_variables[[col]] = t_test_result
}
# check if p-value is less than 0.001
if (t_test_result$p.value < 0.001) {
super_sig_variables[[col]] = t_test_result
}
}
# print summary table
kable(summary_table, caption = "t-test results") %>% kable_styling(full_width = FALSE, position = "left")
| variable | mean_Attrition_Yes | mean_Attrition_No | p_value |
|---|---|---|---|
| Age | 33.785714 | 37.412329 | 0.0000505 |
| DailyRate | 784.292857 | 821.160274 | 0.3188749 |
| DistanceFromHome | 10.957143 | 9.028767 | 0.0164052 |
| EmployeeNumber | 998.371429 | 1035.865753 | 0.4978776 |
| HourlyRate | 67.292857 | 65.291781 | 0.2744798 |
| MonthlyIncome | 4764.785714 | 6702.000000 | 0.0000002 |
| MonthlyRate | 13624.285714 | 14460.123288 | 0.1980950 |
| NumCompaniesWorked | 3.078571 | 2.660274 | 0.0978823 |
| PercentSalaryHike | 15.328571 | 15.175342 | 0.6692297 |
| TotalWorkingYears | 8.185714 | 11.602740 | 0.0000007 |
| TrainingTimesLastYear | 2.650000 | 2.867123 | 0.0594832 |
| YearsAtCompany | 5.192857 | 7.301370 | 0.0002563 |
| YearsInCurrentRole | 2.907143 | 4.453425 | 0.0000015 |
| YearsSinceLastPromotion | 2.135714 | 2.175343 | 0.8983165 |
| YearsWithCurrManager | 2.942857 | 4.369863 | 0.0000051 |
# print list of variables with p-value < 0.05
significant_variable_names = names(significant_variables)
cat("\nVariables with p-value < 0.05:\n", paste(significant_variable_names, collapse = ", "))
Variables with p-value < 0.05:
Age, DistanceFromHome, MonthlyIncome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager
# print list of variables with p-value < 0.001
super_sig_variable_names = names(super_sig_variables)
cat("\nVariables with p-value < 0.001:\n", paste(super_sig_variable_names, collapse = ", "))
Variables with p-value < 0.001:
Age, MonthlyIncome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager
# print full t-test results for variables with p-value < 0.001
cat("\nT-test results for variables with p-value < 0.001:\n")
T-test results for variables with p-value < 0.001:
for (col in super_sig_variable_names) {
cat("Variable:", col, "\n")
print(super_sig_variables[[col]])
cat("\n")
}
Variable: Age
Welch Two Sample t-test
data: data_col by Attrition
t = 4.1509, df = 184.91, p-value = 5.05e-05
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
1.902905 5.350324
sample estimates:
mean in group No mean in group Yes
37.41233 33.78571
Variable: MonthlyIncome
Welch Two Sample t-test
data: data_col by Attrition
t = 5.3249, df = 228.45, p-value = 2.412e-07
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
1220.382 2654.047
sample estimates:
mean in group No mean in group Yes
6702.000 4764.786
Variable: TotalWorkingYears
Welch Two Sample t-test
data: data_col by Attrition
t = 5.1364, df = 201.19, p-value = 6.596e-07
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
2.105259 4.728792
sample estimates:
mean in group No mean in group Yes
11.602740 8.185714
Variable: YearsAtCompany
Welch Two Sample t-test
data: data_col by Attrition
t = 3.7256, df = 191.55, p-value = 0.0002563
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
0.9922099 3.2248155
sample estimates:
mean in group No mean in group Yes
7.301370 5.192857
Variable: YearsInCurrentRole
Welch Two Sample t-test
data: data_col by Attrition
t = 4.9513, df = 208, p-value = 1.522e-06
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
0.9306052 2.1619584
sample estimates:
mean in group No mean in group Yes
4.453425 2.907143
Variable: YearsWithCurrManager
Welch Two Sample t-test
data: data_col by Attrition
t = 4.6826, df = 209.75, p-value = 5.084e-06
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
0.8262439 2.0277679
sample estimates:
mean in group No mean in group Yes
4.369863 2.942857
This provides additional evidence of numerical variables that
likely influence attrition to narrow the pool of important features in
the data set.
Variables with p-value < 0.05: Age, DistanceFromHome,
MonthlyIncome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole,
YearsWithCurrManager
(The last 5 are highly correlated as I saw above.)
Variables with p-value < 0.001: Age, MonthlyIncome,
TotalWorkingYears, YearsAtCompany, YearsInCurrentRole,
YearsWithCurrManager
To gather statistical evidence of categorical
variables associated with attrition, I perform chi-square tests for each
categorical feature with the null hypothesis that the proportion of
attrition is the same across all categories (feature levels).
# extract categorical column names from cs2_categorical, excluding Attrition
categorical_columns = colnames(cs2_categorical)
categorical_columns = categorical_columns[categorical_columns != "Attrition"]
# initialize an empty list to store chi-square test results
chisq_test_results = list()
# initialize an empty dataframe to store summary statistics
summary_table2 = data.frame(variable = character(),
p_value = numeric(),
stringsAsFactors = FALSE)
# initialize empty lists to store variable names with p-value < 0.05 or < 0.001
significant_variables2 = list()
super_sig_variables2 = list()
# loop through each categorical variable
for (col in categorical_columns) {
# create a contingency table for the chi-square test
contingency_table = table(cs2_conv[[col]], cs2_conv$Attrition)
# print(contingency_table)
# run a chi-square test
chisq_test_result = chisq.test(contingency_table)
# store chi-square test result in the list
chisq_test_results[[col]] = chisq_test_result
# extract relevant information and add to summary table
summary_table2 = summary_table2 %>%
add_row(variable = col,
p_value = chisq_test_result$p.value)
# check if p-value is less than 0.05
if (chisq_test_result$p.value < 0.05) {
significant_variables2[[col]] = chisq_test_result
}
# check if p-value is less than 0.001
if (chisq_test_result$p.value < 0.001) {
super_sig_variables2[[col]] = chisq_test_result
}
}
# print summary table
kable(summary_table2, caption = "chi-square test results") %>% kable_styling(full_width = FALSE, position = "left")
| variable | p_value |
|---|---|
| BusinessTravel | 0.0499254 |
| Department | 0.0094238 |
| Education | 0.6242838 |
| EducationField | 0.2682198 |
| EnvironmentSatisfaction | 0.0105409 |
| Gender | 0.5151297 |
| JobInvolvement | 0.0000000 |
| JobLevel | 0.0000000 |
| JobRole | 0.0000000 |
| JobSatisfaction | 0.0111512 |
| MaritalStatus | 0.0000000 |
| OverTime | 0.0000000 |
| PerformanceRating | 0.7461706 |
| RelationshipSatisfaction | 0.3727117 |
| StockOptionLevel | 0.0000000 |
| WorkLifeBalance | 0.0024951 |
# print list of variables with p-value < 0.05
significant_variable_names2 = names(significant_variables2)
cat("\nVariables with p-value < 0.05:\n", paste(significant_variable_names2, collapse = ", "))
Variables with p-value < 0.05:
BusinessTravel, Department, EnvironmentSatisfaction, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, OverTime, StockOptionLevel, WorkLifeBalance
# print list of variables with p-value < 0.001
super_sig_variable_names2 = names(super_sig_variables2)
cat("\nVariables with p-value < 0.001:\n", paste(super_sig_variable_names2, collapse = ", "))
Variables with p-value < 0.001:
JobInvolvement, JobLevel, JobRole, MaritalStatus, OverTime, StockOptionLevel
# print full chi-square test results for variables with p-value < 0.001
cat("\nChi-square test results for variables with p-value < 0.001:\n")
Chi-square test results for variables with p-value < 0.001:
for (col in super_sig_variable_names2) {
cat("Variable:", col, "\n")
print(super_sig_variables2[[col]])
cat("\n")
}
Variable: JobInvolvement
Pearson's Chi-squared test
data: contingency_table
X-squared = 41.465, df = 3, p-value = 5.211e-09
Variable: JobLevel
Pearson's Chi-squared test
data: contingency_table
X-squared = 41.533, df = 4, p-value = 2.085e-08
Variable: JobRole
Pearson's Chi-squared test
data: contingency_table
X-squared = 60.543, df = 8, p-value = 3.647e-10
Variable: MaritalStatus
Pearson's Chi-squared test
data: contingency_table
X-squared = 34.406, df = 2, p-value = 3.379e-08
Variable: OverTime
Pearson's Chi-squared test with Yates' continuity correction
data: contingency_table
X-squared = 62.762, df = 1, p-value = 2.333e-15
Variable: StockOptionLevel
Pearson's Chi-squared test
data: contingency_table
X-squared = 56.245, df = 3, p-value = 3.724e-12
Although, the assumptions of a chi-square test may be violated
here, a deeper knowledge of the test would be helpful, and use of
alternative tests might be more appropriate, this at least provides
additional evidence of categorical variables that likely influence
attrition, facilitating a narrower pool of variables to consider.
Variables with p-value < 0.05: BusinessTravel, Department,
EnvironmentSatisfaction, JobInvolvement, JobLevel, JobRole,
JobSatisfaction, MaritalStatus, OverTime, StockOptionLevel,
WorkLifeBalance
Variables with p-value < 0.001: JobInvolvement, JobLevel,
JobRole, MaritalStatus, OverTime, StockOptionLevel
To look at co-variation among numerical variables that
seem likely to influence attrition, I make scatter plots of all
combinations of those with highly significant t-test results.
# look at co-variation among numerical variables
# function to create scatter plots for each pair of variables
make_scatterplot = function(var1, var2) {
cs2_conv %>%
ggplot(aes(x = .data[[var1]], y = .data[[var2]], color = Attrition)) +
geom_point(alpha = 0.5, position = "jitter") +
geom_smooth() +
labs(title = paste(var1, "vs", var2))
}
# loop through each variable
for (i in 1:(length(super_sig_variable_names) - 1)) {
var1 = super_sig_variable_names[i]
# plot var1 with every other variable that comes after it in the list
for (j in (i + 1):length(super_sig_variable_names)) {
var2 = super_sig_variable_names[j]
scatter = make_scatterplot(var1, var2)
# display and save the individual plots
print(scatter) # using print() so it's easier to find where I call the plot
filename = paste0("plots/EDA/scatter_", var1, "_vs_", var2, ".png")
# ggsave(filename, plot = scatter, device = "png")
}
}
To do the same for combinations of numerical and categorical
variables, I make boxplots of all combinations of those with highly
significant t-test and chi-square test results.
# look at co-variation between numerical and categorical variables
# function to create boxplots for each pair of variables
make_boxplot = function(catvar, numvar) {
cs2_conv %>% group_by(Attrition) %>%
ggplot(aes(x = .data[[catvar]], y = .data[[numvar]], fill = Attrition)) +
geom_boxplot() +
coord_flip() +
labs(title = paste(catvar, "vs", numvar))
}
# loop through each variable
for (i in 1:(length(super_sig_variable_names2))) {
catvar = super_sig_variable_names2[i]
# plot catvar with every numvar
for (j in 1:length(super_sig_variable_names)) {
numvar = super_sig_variable_names[j]
boxplt = make_boxplot(catvar, numvar)
# display and save the individual plots
# I tried to display them in a grid in the interest of space, but they are just too small.
print(boxplt)
filename = paste0("plots/EDA/boxplt_", catvar, "_vs_", numvar, ".png")
# ggsave(filename, plot = boxplt, device = "png")
}
}
To do the same for combinations of categorical variables, I
make bar charts of all combinations of those with highly significant
chi-square test results.
# look at co-variation among categorical variables
# It probably is easier to see relationships if I were to calculate proportions a different way to normalize the data.
# Currently, proportions are calculated like this. For each attrition group, and each level of variable 1, the count of each level of variable 2 (the fill variable) is divided by the sum of the total count for all levels of that variable (within the attrition and variable 1 level).
make_barchart = function(var1, var2) {
cs2_conv %>%
group_by(Attrition, .data[[var1]], .data[[var2]]) %>%
dplyr::summarize(n = n(), .groups = 'drop') %>% #count occurrences within each group
group_by(Attrition, .data[[var1]]) %>% # group by Attrition and var1
mutate(prop = n / sum(n)) %>% #calculate proportion within each group
ggplot(aes(y = .data[[var1]], x = prop, fill = .data[[var2]])) + #use prop as fill
geom_bar(stat = "identity", position = position_dodge2(preserve = "single")) +
facet_wrap(~Attrition, labeller = labeller(Attrition = c("No" = "No Attrition", "Yes" = "Yes Attrition"))) +
labs(title = paste(var1, "vs", var2),
y = var1,
x = paste0("Proportion of ", var2, " within Attrition and ", var1, " level"),
fill = var2) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set1")
}
# loop through each variable
for (i in 1:(length(super_sig_variable_names2) - 1)) {
var1 = super_sig_variable_names2[i]
for (j in (i + 1):length(super_sig_variable_names2)) {
var2 = super_sig_variable_names2[j]
barplt = make_barchart(var1, var2)
# display and save the individual plots
print(barplt)
filename = paste0("plots/EDA/barplt_", var1, "_vs_", var2, ".png")
# ggsave(filename, plot = barplt, device = "png")
}
}
Observations from co-variation plots
Job level
Marital Status
Overtime and Stock Option
Monthly income
Some additional ideas:
As there is high correlation between total working years,
years at company, years in current role and years with current manager,
I want to see if deriving one variable from two of these, might help
offset the correlation with age and extract possibly more meaningful
information with the use of fewer variables.
cs2_conv2 = cs2_conv %>%
mutate(
PercentWrkYrsAtCompany = ifelse(TotalWorkingYears == 0, 0,
round(YearsAtCompany / TotalWorkingYears * 100, 2)),
PercentYrs_wManager = ifelse(TotalWorkingYears == 0, 0,
round(YearsWithCurrManager / TotalWorkingYears * 100, 2)),
PercentYrs_inRole = ifelse(TotalWorkingYears == 0, 0,
round(YearsInCurrentRole / TotalWorkingYears * 100, 2)),
PercentYrs_inRoleAtComp = ifelse(YearsAtCompany == 0, 0,
round(YearsInCurrentRole / YearsAtCompany * 100, 2))
)
# str(cs2_conv2)
# summary(cs2_conv2)
# plot these derived variables with other variables that seem relevant and alone to see the distributions
cs2_conv2 %>% ggplot(aes(x = Age, y = PercentWrkYrsAtCompany, color = Attrition)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = MonthlyIncome, y = PercentWrkYrsAtCompany, color = Attrition)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = Age, y = PercentYrs_inRoleAtComp, color = Attrition)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = MonthlyIncome, y = PercentYrs_inRoleAtComp, color = Attrition)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = MaritalStatus, y = PercentWrkYrsAtCompany, fill = Attrition)) +
geom_boxplot(position = "dodge")
cs2_conv2 %>% ggplot(aes(x = JobLevel, y = PercentWrkYrsAtCompany, fill = Attrition)) +
geom_boxplot(position = "dodge")
cs2_conv2 %>% ggplot(aes(x = MaritalStatus, y = PercentYrs_inRoleAtComp, fill = Attrition)) +
geom_boxplot(position = "dodge")
cs2_conv2 %>% ggplot(aes(x = JobLevel, y = PercentYrs_inRoleAtComp, fill = Attrition)) +
geom_boxplot(position = "dodge")
cs2_conv2 %>% ggplot(aes(x = PercentWrkYrsAtCompany, after_stat(density), fill = Attrition)) +
geom_histogram(binwidth = 10) + facet_wrap(~Attrition)
cs2_conv2 %>% ggplot(aes(x = PercentYrs_inRoleAtComp, y = after_stat(density), fill = Attrition)) +
geom_histogram(binwidth = 10) + facet_wrap(~Attrition)
# run t-tests to check for possible differences
t.test(PercentWrkYrsAtCompany ~ Attrition, data = cs2_conv2)
Welch Two Sample t-test
data: PercentWrkYrsAtCompany by Attrition
t = 0.94099, df = 187.08, p-value = 0.3479
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-3.316668 9.366557
sample estimates:
mean in group No mean in group Yes
68.70130 65.67636
t.test(PercentYrs_wManager ~ Attrition, data = cs2_conv2)
Welch Two Sample t-test
data: PercentYrs_wManager by Attrition
t = 3.2429, df = 188.9, p-value = 0.001399
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
3.638589 14.938990
sample estimates:
mean in group No mean in group Yes
41.57358 32.28479
t.test(PercentYrs_inRole ~ Attrition, data = cs2_conv2)
Welch Two Sample t-test
data: PercentYrs_inRole by Attrition
t = 3.058, df = 189.5, p-value = 0.00255
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
3.079205 14.270408
sample estimates:
mean in group No mean in group Yes
42.37452 33.69971
t.test(PercentYrs_inRoleAtComp ~ Attrition, data = cs2_conv2)
Welch Two Sample t-test
data: PercentYrs_inRoleAtComp by Attrition
t = 3.55, df = 177.46, p-value = 0.0004933
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
5.468542 19.158856
sample estimates:
mean in group No mean in group Yes
59.0657 46.7520
Based on the plots and t-test results, percent years in role at
company (derived from years in role divided by years at company), seems
like the most potentially useful of these derived variables.
I further explore the usefulness of the percent years
in role at company by plotting combinations of percent years in role at
company, monthly income, age, and job level.
cs2_conv2 %>% ggplot(aes(x = PercentYrs_inRoleAtComp, y = MonthlyIncome, color = JobLevel)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = PercentYrs_inRoleAtComp, y = Age, color = JobLevel)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = MonthlyIncome, y = Age, color = JobLevel)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
Monthly income is a good predictor of job level (or vice versa) no
matter the percent years in role at the company. Age is not as good
(though there is some predictability). There is a much wider
distribution of ages the lower the income, particularly under
5000.
Job level seems like a powerful predictor.
So far, the visual and statistical evidence and my
best guess for top features leading to attrition are Job Level, Marital
Status, Monthly Income and Overtime. To confirm this and/or narrow my
list to 3, I want to use the list of 12 features that seem like they
have the greatest differences between attrition categories. I generate
all the unique combinations of 3 variables. Then using the loop in this
code, I use each combination to build a Naive Bayes model to predict
attrition, and rank them by how well each combination
performed.
modelIdx = c(2,15,16,17,19,20,24,29,30,33,34,36) #indexes of predicted top features
# these are the indexes for Age, JobInvolvement, JobLevel, JobRole, MaritalStatus, MonthlyIncome, OverTime, StockOptionLevel, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager
# initialize an empty data frame to store results
results = data.frame(Index1 = integer(), Index2 = integer(), Index3 = integer(), Sensitivity = numeric(), Specificity = numeric(), Accuracy = numeric(), stringsAsFactors = FALSE)
# number of different splits
iters = 50
# 70-30 train/test split
splitPerc = 0.7
# generate all combinations of 3 indexes
index_combinations = combinat::combn(modelIdx, 3)
# loop through each combination of 3 indexes
for (i in 1:ncol(index_combinations)) {
index_set = index_combinations[, i]
# create holders for the sensitivity, specificity, and accuracy for each iteration
masterSens = numeric(iters)
masterSpec = numeric(iters)
masterAcc = numeric(iters)
# iterate over the specified number of iterations
for (j in 1:iters) {
set.seed(j)
# sample 70% for training
trainIdx = sample(1:dim(cs2_conv2)[1], round(splitPerc * dim(cs2_conv2)[1]))
# select the rows for training
AttritionTrn = cs2_conv2[trainIdx, ]
# select the remaining rows for testing
AttritionTst = cs2_conv2[-trainIdx, ]
# temporarily set the current indexes for modeling
temp_modelIdx = index_set
# train the model
modelNB = naiveBayes(AttritionTrn[, temp_modelIdx], AttritionTrn$Attrition, laplace = 1)
# predict using the model
preds = predict(modelNB, AttritionTst[, temp_modelIdx])
# make a confusion matrix
CM = confusionMatrix(table(preds, AttritionTst$Attrition))
# store sensitivity, specificity, and accuracy
masterSens[j] = CM$byClass["Sensitivity"]
masterSpec[j] = CM$byClass["Specificity"]
masterAcc[j] = CM$overall["Accuracy"]
}
# calculate average sensitivity, specificity, and accuracy
avg_sensitivity = mean(masterSens)
avg_specificity = mean(masterSpec)
avg_accuracy = mean(masterAcc)
# store the results including the three indexes
results = rbind(results, list(Index1 = index_set[1], Index2 = index_set[2], Index3 = index_set[3], Sensitivity = avg_sensitivity, Specificity = avg_specificity, Accuracy = avg_accuracy))
}
# output the results
results = results %>% arrange(-Specificity)
kable(head(results, 10), caption = "3 variable combos to predict attrition") %>% kable_styling(full_width = FALSE, position = "left")
| Index1 | Index2 | Index3 | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|---|
| 16 | 20 | 24 | 0.9409578 | 0.3656826 | 0.8475096 |
| 20 | 24 | 29 | 0.9571661 | 0.3050640 | 0.8511877 |
| 19 | 24 | 29 | 0.9470747 | 0.3003832 | 0.8422222 |
| 16 | 24 | 29 | 0.9722697 | 0.2859146 | 0.8607663 |
| 24 | 29 | 34 | 0.9593601 | 0.2833596 | 0.8495019 |
| 16 | 24 | 34 | 0.9585989 | 0.2794801 | 0.8483525 |
| 16 | 24 | 30 | 0.9544960 | 0.2769441 | 0.8440613 |
| 16 | 19 | 29 | 0.9107722 | 0.2750041 | 0.8068966 |
| 16 | 24 | 36 | 0.9586844 | 0.2522911 | 0.8435249 |
| 24 | 29 | 36 | 0.9626306 | 0.2439620 | 0.8453640 |
The top three features that predict attrition are within the
group that I initially highlighted. They are Job Level, Monthly Income
and Overtime. Stock Option Level and Marital Status are also strong
predictors.
Obtaining marital status data may pose challenges due to potential
privacy considerations, unlike the other predictors which are likely
readily available through employment records.
This provides an overview of the first step of my
analysis process focusing on visual evidence.
I started the analysis by plotting each variable separated by
attrition group to compare their populations. Visually, the many
features on this list seemed to show differences in these populations,
but the four starred variables seemed particularly important.
Important features predicted by visual evidence:
Age
Business Travel
Distance from home
Department
Education Field
Environmental Satisfaction
Job Involvement
Job Level *
Job Role
Job Satisfaction
Marital Status *
Monthly Income *
Number Companies Worked For
Overtime *
Relationship Satisfaction
Stock Option Level
Total Working Years
Work Life Balance
Years At Company
Years In Current Role
Years With Current Manager
To illustrate an example of positive and negative visual
evidence, I plot attrition by job level and gender.
# plot positive predictors example: Job Level for presentation
# also plot negative predictor: Gender
# I revert to using my original converted dataset `cs2_conv`.
# plot positive predictor example: job level and attrition
# calculation of proportion of employees: employees in job level attrition yes / total employees in job level, to normalize each job level
jobLevel_bar = cs2_conv %>%
group_by(JobLevel, Attrition) %>%
dplyr::summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
filter(Attrition == "Yes") %>%
ggplot(aes(y = JobLevel, x = prop, fill = JobLevel)) +
geom_bar(stat = "identity") +
labs(y = "Job Level", x = "Proportion of Attrition",
fill = "Job Level",
title = "Proportion of Attrition within Each Job Level",
caption = "Normalized by Total Employees per Job Level") +
theme_bw() +
scale_fill_brewer(palette = "Set1")
print(jobLevel_bar)
# ggsave(jobLevel_bar, filename = "plots/top3_jobLevel_bar.png")
# plot negative predictor example: gender and attrition
genderBar = cs2_conv %>%
group_by(Gender, Attrition) %>%
dplyr::summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
filter(Attrition == "Yes") %>%
ggplot(aes(y = Gender, x = prop)) +
geom_bar(stat = "identity", fill = "#377EB8") + #fill color = Attrition group blue
labs(y = "Gender", x = "Proportion of Attrition",
fill = "Gender",
title = "Proportion of Attrition within Each Gender",
caption = "Normalized by Total Employees per Gender") +
theme_bw() +
theme(text = element_text(size = 12))
genderBar
# ggsave(genderBar, filename = "plots/top3negExample_genderBar.png")
Job level 1 has a much higher percent attrition than any other
job level. Over a quarter of the employees in job level 1 leave the
company. This is at least 13% higher than any other job level. Also of
note, only 5% of employees in job level 4 leave.
On the other hand, gender doesn’t contribute to attrition. There is
very little difference in attrition rates between males and females (17%
and 15%).
This provides an overview of the other steps in my analysis
process focusing on statistical evidence and modeling.
I tested for statistically significant differences in the attrition
groups for each variable that was highlighted by visual evidence. For
the numerical values, I used a t-test to predict the likelihood that the
means of attrition groups are equal. For the categorical variables, I
used a chi-square test to predict the likelihood that proportion of
attrition is the same across categories of a variable.
These variables, including the four starred previously, had
significant differences below the alpha 0.05 level.
Age
Job Involvement
Job Level *
Job Role
Marital Status *
Monthly Income *
Overtime *
Stock Option Level
Total Working Years
Years At Company
Years In Current Role
Years With Current Manager
From this list, I identified a subset of three variables that
achieved the highest specificity when used in a predictive model for
attrition. In the context of the model, specificity refers to the
percentage of instances where attrition is correctly predicted.
The subset of three top factors that lead to attrition are Job
Level, Monthly Income and Overtime.
# plot the other positive predictors for presentation: Monthly Income, Job Level and Overtime
# plot monthly income and attrition with a boxplot and two histograms
incomeBox = cs2_conv %>%
ggplot(aes(x = MonthlyIncome, y = Attrition, fill = Attrition)) +
geom_boxplot() +
ylab("Attrition") +
theme_bw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
scale_fill_brewer(palette = "Set1")
yAtt_hist = cs2_conv %>% filter(Attrition == "Yes") %>%
ggplot(aes(x = MonthlyIncome)) +
geom_histogram(binwidth = 1000, fill = "#377EB8") +
labs(title = "Distribution of Monthly Income by Employee Attrition",
y = "Yes") +
theme_bw() +
theme(legend.position = "none",
axis.title.x = element_blank())
nAtt_hist = cs2_conv %>% filter(Attrition == "No") %>%
ggplot(aes(x = MonthlyIncome)) +
geom_histogram(binwidth = 1000, fill = "#E41A1C") +
labs(#title = "Employee Attrition and Monthly Income",
y = "No") +
theme_bw() +
theme(legend.position = "none")
incomePlt = yAtt_hist / incomeBox / nAtt_hist
print(incomePlt)
# ggsave(incomePlt, filename = "plots/top3_incomeDist.png")
# make co-variation plots split by attrition groups
# plot positive predictor: job level and monthly income
income_jobLevel_box = ggplot(cs2_conv, aes(x = JobLevel, y = MonthlyIncome, fill = Attrition)) +
geom_boxplot() +
labs(x = "Job Level", y = "Monthly Income", fill = "Attrition") +
ggtitle("Monthly Income by Job Level and Attrition") +
theme_bw() +
scale_fill_brewer(palette = "Set1")
income_jobLevel_box
# ggsave(income_jobLevel_box, filename = "plots/top3_income_jobLevel_box.png")
# plot positive predictor: job level and attrition, with NO overtime
# employees NOT working overtime
jobLevel_OTno_bar = cs2_conv %>%
group_by(OverTime, JobLevel, Attrition) %>%
dplyr::summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
filter(Attrition == "Yes" & OverTime == "No") %>%
ggplot(aes(y = JobLevel, x = prop, fill = JobLevel)) +
geom_bar(stat = "identity", position = "dodge") +
labs(y = "Job Level", x = "Proportion of Attrition",
fill = "Job Level",
subtitle = "Proportion of Attrition within Each Job Level",
title = "Employees with No Overtime Hours",
caption = "Normalized by Total Employees per Job Level") +
xlim(0.00, 0.55) +
theme_bw() +
theme(text = element_text(size = 12)) +
scale_fill_brewer(palette = "Set1")
print(jobLevel_OTno_bar)
# ggsave(jobLevel_OTno_bar, filename = "plots/top3_jobLevel_OTno_bar.png")
# plot positive predictor: job level and attrition, with overtime
# employees working at least an hour of overtime
jobLevel_OTyes_bar = cs2_conv %>%
group_by(OverTime, JobLevel, Attrition) %>%
dplyr::summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
filter(Attrition == "Yes" & OverTime == "Yes") %>%
ggplot(aes(y = JobLevel, x = prop, fill = JobLevel)) +
geom_bar(stat = "identity", position = "dodge") +
labs(y = "Job Level", x = "Proportion of Attrition",
fill = "Job Level",
subtitle = "Proportion of Attrition within Each Job Level",
title = "Employees Working Overtime",
caption = "Normalized by Total Employees per Job Level") +
xlim(0.00, 0.55) +
theme_bw() +
theme(text = element_text(size = 12)) +
scale_fill_brewer(palette = "Set1")
print(jobLevel_OTyes_bar)
# ggsave(jobLevel_OTyes_bar, filename = "plots/top3_jobLevel_OTyes_bar.png")
# find 25th, 50th and 75th percentile of monthly incomes for each attrition group and job level
# is there a stay/go cut off for job level 4
mid50_income = cs2_conv %>%
group_by(Attrition, JobLevel) %>%
dplyr::summarise(q25 = quantile(MonthlyIncome, 0.25),
median = median(MonthlyIncome),
q75 = quantile(MonthlyIncome, 0.75))
kable(mid50_income, caption = "monthly income q25, median, q75 for each job level and attrition group") %>% kable_styling(full_width = FALSE, position = "left")
| Attrition | JobLevel | q25 | median | q75 |
|---|---|---|---|---|
| No | 1 | 2293.00 | 2695.0 | 3207.0 |
| No | 2 | 4632.25 | 5365.5 | 6265.0 |
| No | 3 | 8624.50 | 9985.0 | 10911.5 |
| No | 4 | 13770.00 | 15992.0 | 16799.0 |
| No | 5 | 18844.00 | 19189.0 | 19636.0 |
| Yes | 1 | 2152.50 | 2424.5 | 2860.0 |
| Yes | 2 | 4712.25 | 5325.0 | 6162.5 |
| Yes | 3 | 7978.00 | 9582.0 | 10325.0 |
| Yes | 4 | 12552.50 | 12936.0 | 13065.0 |
| Yes | 5 | 19364.75 | 19695.0 | 19848.5 |
The plot of distributions of monthly income reveals a smaller
income range and median income among employees that leave (in blue). The
high-income outliers are likely due to factors like retirement.
The boxplot of job levels and income illustrates the positive
relationship between these variables. It reveals a large income gap
between employees that stay and those that leave in job level 4. Job
levels 1 and 3 have a small trend of attrition with lower median
incomes.
Attrition rates are low for all job levels in the graph of
employees that do not work overtime, under 15% for job level 1 and under
10% for the others. However, the story changes for employees working at
least an hour of overtime. Over 50% of employees in job level 1 working
overtime choose to leave. This pattern holds true across all job levels
except for job level 4, with attrition rates more than doubling among
overtime workers.
# extra plots for the appendix
# plot positive predictor example: overtime and attrition
# calculation of proportion of employees: employees in overtime group & attrition yes / total employees in overtime group, to normalize each overtime group
overtime_bar = cs2_conv %>%
group_by(OverTime, Attrition) %>%
dplyr::summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
filter(Attrition == "Yes") %>%
ggplot(aes(y = OverTime, x = prop)) +
geom_bar(stat = "identity", fill = "#377EB8") +
labs(y = "Overtime", x = "Proportion of Attrition",
fill = "Overtime",
title = "Proportion of Attrition for Employees With and Without Overtime",
caption = "Normalized by Total Employees per OT Group") +
theme_bw() +
scale_fill_brewer(palette = "Set1")
print(overtime_bar)
# ggsave(overtime_bar, filename = "plots/appendix/top3_overtime_bar.png")
# plot positive predictor: monthly income and overtime
income_OT_box = ggplot(cs2_conv, aes(x = OverTime, y = MonthlyIncome, fill = Attrition)) +
geom_boxplot() +
labs(x = "Overtime", y = "Monthly Income", fill = "Attrition") +
ggtitle("Monthly Income by Overtime and Attrition") +
theme_bw() +
scale_fill_brewer(palette = "Set1")
print(income_OT_box)
# ggsave(income_OT_box, filename = "plots/appendix/top3_income_OT_box.png")
The attrition rate for overtime workers is more than three times
that of employees that don’t work any overtime.
There is a larger income gap between attrition groups in employees
working overtime, indicating a possibility that lower income and
overtime make attrition more likely.
Top three factors leading to attrition: summary of
trends, takeaways, and recommendations
Income
There is a large income gap between employees that stay and those that
leave in job level 4. Those that leave have a median income just under
13,000, while those that stay have a median income of almost 16,000. Job
levels 1 and 3 have a small trend of attrition with lower median
incomes. Overall, the median income for employees that leave is smaller
than that of those that stay.
Job Level
Over a quarter of the employees in job level 1 leave the company, a much
higher attrition rate than any other job level.
Overtime
Over 50% of employees in job level 1 working overtime choose to leave.
This pattern holds true across all job levels except for job level 4,
with attrition rates more than doubling among overtime workers.
Recommendations
Job level 4, if below 14,000, increase monthly pay. For all other job
levels, reduce overtime. Job level 1 increase pay and promote to job
level 2.
For each numerical variable, I plot the job roles in
side-by-side box plots. Using bar plots, I plot the proportion of the
levels of each categorical variable grouped by job roles.
# reorder levels of JobRole from low income/level to high
cs2_conv$JobRole = factor(cs2_conv$JobRole, levels = c("Sales Representative", "Laboratory Technician", "Human Resources", "Research Scientist", "Sales Executive", "Healthcare Representative", "Manufacturing Director", "Research Director", "Manager"))
# rename numerical and categorical variables for nicer axis titles
custom_labels = list(
JobLevel = "Job Level",
JobRole = "Job Role",
MaritalStatus = "Marital Status",
MonthlyIncome = "Monthly Income",
OverTime = "Overtime",
StockOptionLevel = "Stock Option Level",
TotalWorkingYears = "Total Working Years",
YearsAtCompany = "Years At Company",
YearsInCurrentRole = "Years in Current Role",
YearsWithCurrManager = "Years with Current Manager"
)
# a function to get the appropriate label from the list above
get_label = function(var) {
if (var %in% names(custom_labels)) {
return(custom_labels[[var]])
} else {
return(var)
}
}
# numerical_columns = colnames(cs2_numerical)
for (numvar in numerical_columns) {
boxplot = cs2_conv %>%
ggplot(aes(x = .data[[numvar]], y = JobRole, fill = JobRole)) +
geom_boxplot() +
labs(title = paste("Job Role vs", get_label(numvar)),
x = get_label(numvar), y = "Job Role") +
theme_bw() +
theme(legend.position = "none",
text = element_text(size = 12)) +
scale_fill_brewer(palette = "Spectral")
# display each plot
print(boxplot)
# create a unique filename for each plot
filename = paste0("plots/JobRole/boxplt_JobRole_vs_", numvar, ".png")
# save the plots
# ggsave(filename, plot = boxplot, device = "png")
}
# categorical_columns = colnames(cs2_categorical)
cat_cols = categorical_columns[!categorical_columns %in% "JobRole"]
for (catvar in cat_cols) {
barplot = cs2_conv %>%
group_by(JobRole, .data[[catvar]], .drop = FALSE) %>%
dplyr::summarize(count = n(), .groups = 'drop') %>%
group_by(JobRole) %>%
mutate(prop = count / sum(count)) %>%
ggplot(aes(y = JobRole, x = prop, fill = .data[[catvar]])) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = paste("Job Role vs", get_label(catvar)),
y = "Job Role", x = "Proportion", fill = get_label(catvar)) +
theme_bw() +
theme(legend.position = "right",
text = element_text(size = 12)) +
scale_fill_brewer(palette = "Set1")
#display each plot
print(barplot)
# create a unique filename for each plot
filename = paste0("plots/JobRole/barplot_JobRole_vs_", catvar, ".png")
# save the plots
# ggsave(filename, plot = barplot, device = "png")
}
Trends in Job Roles
As job roles seem to segregate into 3 groups by monthly
income and other factors like job level, years in current role, with
manager and at company, total working years and to a lesser extent age,
I make a new median income level variable and group them into low,
middle and high levels. Then I use a t-test to see if the difference in
mean incomes between the pairs of levels (low vs. mid and mid vs. high)
are significantly different. I also plot the derived variables Percent
work years at company and Percent year in role at company vs the job
roles and run t-tests to check for differences in these mean percents
between the grouped income levels.
# reorder levels of JobRole
cs2_conv2$JobRole = factor(cs2_conv2$JobRole, levels = c("Sales Representative", "Laboratory Technician", "Human Resources", "Research Scientist", "Sales Executive", "Healthcare Representative", "Manufacturing Director", "Research Director", "Manager"))
# group roles by median income level (low, med, high)
# test difference in salary between low and mid and mid and high
cs2_conv2 = cs2_conv2 %>%
mutate(incomeGroup = case_when(
JobRole %in% c("Sales Representative", "Research Scientist", "Human Resources", "Laboratory Technician") ~ "Low",
JobRole %in% c("Sales Executive", "Manufacturing Director", "Healthcare Representative") ~ "Medium",
TRUE ~ "High"
)
) %>%
mutate(incomeGroup = factor(incomeGroup, levels = c("Low", "Medium", "High")))
# check the conversion
# head(cs2_conv2)
# filter the data
cs2_LowMid = cs2_conv2 %>% filter(incomeGroup != "High")
cs2_MidHigh = cs2_conv2 %>% filter(incomeGroup != "Low")
# perform t-tests
t_test_LowMid = t.test(MonthlyIncome ~ incomeGroup, data = cs2_LowMid)
t_test_MidHigh = t.test(MonthlyIncome ~ incomeGroup, data = cs2_MidHigh)
# print the results
t_test_LowMid
Welch Two Sample t-test
data: MonthlyIncome by incomeGroup
t = -27.646, df = 522.61, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Low and group Medium is not equal to 0
95 percent confidence interval:
-4268.293 -3701.937
sample estimates:
mean in group Low mean in group Medium
3167.491 7152.606
t_test_MidHigh
Welch Two Sample t-test
data: MonthlyIncome by incomeGroup
t = -31.529, df = 152.77, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Medium and group High is not equal to 0
95 percent confidence interval:
-9904.980 -8736.867
sample estimates:
mean in group Medium mean in group High
7152.606 16473.529
Unsurprisingly, there is overwhelming evidence that mean incomes
between income groups are significantly different.
For follow-up
It would be interesting to follow-up by plotting the 3 income groups and
income or other features of interest split by attrition groups. Also, it
would be good to investigate trends in income, job level, and overtime
between the different attrition groups in different job roles.
(with a minimum of 60% sensitivity and 60% specificity)
I start with a Naive Bayes model to predict attrition.
There are 33 possible explanatory features in the data set. I choose to
start with the 18 that earlier analyses suggest may be significantly
different between the attrition groups. The goal with this block of code
is to build an externally validated model with 100 different train/test
splits of those 18 features. Then using intuition from earlier analyses,
I remove one feature at a time, until I have maximized the specificity
of the model.
# Naive Bayes classifier
# start with lots of explanatory variables and distill down manually one at a time using domain knowledge
# str(cs2_conv2) #this is the dataframe with my derived variables too
# list indexes of explanatory variables
# for reference:
# modelIdx = c(2,4:9,11:19,21,22,25:27,29:40) #list of all possible features
# modelIdx = c(16,19,20,24,29) #top 5 features
# start here: list of indexes of 18 significant variables (alpha = 0.05) from t-tests and chi-square tests
# modelIdx = c(2,4,6,7,12,15,16,17,18,19,20,24,29,30,32,33,34,36) #significant features: sens 85.6%, spec 60.7%
# end here: list of indexes which produced highest specificity, derived manually
modelIdx = c(12,15,16,17,19,20,24,29,30,32,33,34,36) #sensitivity: 84.4% specificity: 62.3%
# other promising combinations:
# modelIdx = c(2,12,15,16,17,19,20,24,29,30,32,33,34,36) #84.2, 61.4
# modelIdx = c(12,15,16,17,20,24,29,30,32,33,34,36) #84.6, 61.3
# modelIdx = c(15,16,17,20,24,29,30,32,33,34,36) #84, 60.5
# list the explanatory variables used
names(cs2_conv2[,modelIdx])
[1] "EnvironmentSatisfaction" "JobInvolvement"
[3] "JobLevel" "JobRole"
[5] "MaritalStatus" "MonthlyIncome"
[7] "OverTime" "StockOptionLevel"
[9] "TotalWorkingYears" "WorkLifeBalance"
[11] "YearsAtCompany" "YearsInCurrentRole"
[13] "YearsWithCurrManager"
# how many explanatory variables are used
length(modelIdx)
[1] 13
# 100 different splits
iters = 100
# make holders for the stats from each iteration
masterAccu = matrix(nrow = iters, ncol = 2)
masterPval = matrix(nrow = iters, ncol = 2)
masterSens = matrix(nrow = iters, ncol = 2)
masterSpec = matrix(nrow = iters, ncol = 2)
# 70-30 train/test split
splitPerc = .7
for(i in 1:iters)
{
set.seed(i)
# sample 70%
trainIdx = sample(1:dim(cs2_conv2)[1], round(splitPerc*dim(cs2_conv2)[1]))
# choose the rows that match those sampled numbers for training
AttritionTrn = (cs2_conv2[trainIdx,])
# and the others for testing
AttritionTst = (cs2_conv2[-trainIdx,])
# head(AttritionTrn)
# head(AttritionTst)
# give the model the training variables, training labels
modelNB = naiveBayes(AttritionTrn[,modelIdx], AttritionTrn$Attrition, laplace = 1)
# use the model and testing explanatory variables (modelIdx) to predict the testing labels
preds = predict(modelNB, AttritionTst[,modelIdx])
# make a confusion matrix comparing the predicted labels to the true labels
# predicted labels = rows, true labels = cols
CM = confusionMatrix(table(preds, AttritionTst$Attrition))
masterAccu[i,] = c(i, CM$overall["Accuracy"])
masterPval[i,] = c(i, CM$overall["AccuracyPValue"])
masterSens[i,] = c(i, CM$byClass["Sensitivity"])
masterSpec[i,] = c(i, CM$byClass["Specificity"])
}
# organize the output data
colnames(masterAccu) = c("Seed", "Accuracy")
colnames(masterPval) = c("Seed", "AccuracyPValue")
colnames(masterSens) = c("Seed", "Sensitivity")
colnames(masterSpec) = c("Seed", "Specificity")
NB_results = merge(as.data.frame(masterAccu), as.data.frame(masterPval), by = "Seed", all = TRUE)
NB_results = merge(NB_results, as.data.frame(masterSens), by = "Seed", all = TRUE)
NB_results = merge(NB_results, as.data.frame(masterSpec), by = "Seed", all = TRUE)
NB_stats = colMeans(NB_results[,2:5])
NB_stats = data.frame(
Metric = c("Accuracy", "AccuracyPValue", "Sensitivity", "Specificity"),
Value = c(sprintf("%.1f%%", NB_stats[1]*100),
sprintf("%.2e", NB_stats[2]),
sprintf("%.1f%%", NB_stats[3]*100),
sprintf("%.1f%%", NB_stats[4]*100)))
# output the metrics of the naive bayes model
cat("Naive Bayes \n")
Naive Bayes
cat("Accuracy:", NB_stats[1,2], "\n")
Accuracy: 80.9%
cat("Accuracy P-Value:", NB_stats[2,2], "\n")
Accuracy P-Value: 7.83e-01
cat("Sensitivity:", NB_stats[3,2], "\n")
Sensitivity: 84.4%
cat("Specificity:", NB_stats[4,2], "\n")
Specificity: 62.3%
I start with these 18 features: Age, BusinessTravel, Department,
DistanceFromHome, EnvironmentSatisfaction, JobInvolvement, JobLevel,
JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome, OverTime,
StockOptionLevel, TotalWorkingYears, WorkLifeBalance, YearsAtCompany,
YearsInCurrentRole, YearsWithCurrManager.
Using the method described above, I am able to achieve the highest
specificity of 62.3% and a combined sensitivity and specificity of 146.7
with 13 features: EnvironmentSatisfaction, JobInvolvement, JobLevel,
JobRole, MaritalStatus, MonthlyIncome, OverTime, StockOptionLevel,
TotalWorkingYears, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole,
YearsWithCurrManager.
Then I attempt to validate my predicted best features
for the model with help from chatGPT. I asked chatGPT to help me write a
nested loop that would find every combination of a defined number of
features, rank them by highest specificity and return the top ten
combinations. For each iteration of the loop, I asked it to increase the
number of features by one, breaking out of the loop when the specificity
stopped increasing. This is very computationally heavy, and there is a
bug in the code when outputting results related to breaking out of the
loop.
# perhaps it would be more efficient to do the train test splits on the outer loop?
# also I should place the progress reporter (Sys.time()) in a different place in the code
# Function to calculate sensitivity, specificity for all combinations of a given number of variables
calculate_combinations <- function(modelIdx, max_vars = 18, iters = 10, splitPerc = 0.7) {
# Initialize an empty list to store results
results_list <- list()
# Initialize a variable to keep track of the maximum specificity reached
max_specificity_so_far <- 0
# Initialize a variable to store the maximum mean specificity from previous iterations
max_mean_specificity_previous <- 0
# Loop over different number of variables
for (num_vars in seq(8, max_vars)) {
cat("Number of variables:", num_vars, "\n")
start_time <- Sys.time()
# Generate all combinations of variables
index_combinations <- combinat::combn(modelIdx, num_vars)
# Initialize an empty list to store results for this iteration
temp_results_list <- list()
# Loop through each combination of variables
for (i in 1:ncol(index_combinations)) {
index_set <- index_combinations[, i]
# Create holders for the sensitivity, specificity for each iteration
masterSens <- numeric(iters)
masterSpec <- numeric(iters)
# Iterate over the specified number of iterations
for (j in 1:iters) {
set.seed(j)
# Sample 70% for training
trainIdx <- sample(1:dim(cs2_conv2)[1], round(splitPerc * dim(cs2_conv2)[1]))
# Select the rows for training
AttritionTrn <- cs2_conv2[trainIdx, ]
# Select the remaining rows for testing
AttritionTst <- cs2_conv2[-trainIdx, ]
# Temporarily set the current indexes for modeling
temp_modelIdx <- index_set
# Train the model
modelNB <- naiveBayes(AttritionTrn[, temp_modelIdx], AttritionTrn$Attrition, laplace = 1)
# Predict using the model
preds <- predict(modelNB, AttritionTst[, temp_modelIdx])
# Confusion matrix
CM <- confusionMatrix(table(preds, AttritionTst$Attrition))
# Store sensitivity, specificity
masterSens[j] <- CM$byClass["Sensitivity"]
masterSpec[j] <- CM$byClass["Specificity"]
}
# Average sensitivity, specificity
avg_sensitivity <- mean(masterSens)
avg_specificity <- mean(masterSpec)
# Store the results including the combination of variables
temp_results_list[[i]] <- list(variables = index_set, Sensitivity = avg_sensitivity, Specificity = avg_specificity)
}
end_time <- Sys.time()
time_taken <- end_time - start_time
cat("Time taken:", time_taken, "\n")
# Store the results for this iteration
results_list[[num_vars]] <- temp_results_list
# Get the maximum mean specificity for this number of variables
max_mean_specificity_current <- max(sapply(temp_results_list, function(x) x$Specificity))
# Check if the maximum mean specificity for the current number of variables is less than the previous maximum
if (max_mean_specificity_current < max_mean_specificity_previous) {
cat("Maximum specificity reached for", num_vars, "variables is less than previous maximum. Stopping further iterations.\n")
break
}
# Update the maximum mean specificity from previous iterations
max_mean_specificity_previous <- max_mean_specificity_current
}
return(results_list)
}
# Define the list of indexes to loop through
modelIdx <- c(2,4,6,7,12,15,16,17,18,19,20,24,29,30,32,33,34,36) #significant features
# Define the maximum number of variables
max_vars <- 18
# Call the function to calculate combinations
results <- calculate_combinations(modelIdx, max_vars)
# Determine the maximum expected number of combinations
max_combinations <- max_vars * 10 # Assuming a maximum of 10 combinations per number of variables
# Preallocate the output data frame with maximum expected number of rows
output_df <- data.frame(Number_of_Variables = integer(max_combinations),
Variable_Indexes = character(max_combinations),
Mean_Sensitivity = numeric(max_combinations),
Mean_Specificity = numeric(max_combinations),
Sum_of_Sensitivity_and_Specificity = numeric(max_combinations),
stringsAsFactors = FALSE)
# Loop over different number of variables
for (num_vars in seq(8, max_vars)) {
cat("\nTop 10 combinations with the highest specificity for", num_vars, "variables:\n")
sorted_indices <- order(sapply(results[[num_vars]], function(x) x$Specificity), decreasing = TRUE)
for (i in 1:min(10, length(sorted_indices))) {
index <- sorted_indices[i]
cat("Combination", i, ": Variables =", results[[num_vars]][[index]]$variables, ", Specificity =", results[[num_vars]][[index]]$Specificity, "\n")
# Assign the results to the preallocated rows in the output data frame
row_index <- (num_vars - 10) * 10 + i # Calculate the row index
output_df$Number_of_Variables[row_index] <- num_vars
output_df$Variable_Indexes[row_index] <- paste(results[[num_vars]][[index]]$variables, collapse = ", ")
output_df$Mean_Sensitivity[row_index] <- results[[num_vars]][[index]]$Sensitivity
output_df$Mean_Specificity[row_index] <- results[[num_vars]][[index]]$Specificity
output_df$Sum_of_Sensitivity_and_Specificity[row_index] <- results[[num_vars]][[index]]$Sensitivity + results[[num_vars]][[index]]$Specificity
}
}
# Print the final output data frame
print(output_df)
This is useful for a few reasons. It confirms that the 13
variables I chose
modelIdx = c(12,15,16,17,19,20,24,29,30,32,33,34,36) result
in the highest specificity (61.7%) for all combinations of 13 variables.
It also returns higher specificity, 62.6% with 14 variables, the 13
above, plus Department (index 6) and 62.2% with 15
variables, the 13 above, plus BusinessTravel and
DistanceFromHome (indexes 4 and 7).
I want to use several of the best combinations of
explanatory variables to try to tune the laplace smoothing parameter. I
also want to see if I can adjust and tune the thresholding, because the
dataset is unbalanced with only 16% “Yes” in the Attrition variable. I
evaluate all combinations of variables and tuning parameters by
averaging external and internal validation results.
# computationally expensive, not sure how to do it a better way
# define parameters (the ones I really used)
# modelIdx_list = list(
# c(2, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36), # Combination 1
# c(12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36), # Combination 2
# c(6, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36), # Combination 3
# c(4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36), # Combination 4
# c(4, 6, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36) # Combination 5
# )
# alphas = seq(0.1, 1, 0.1) #smoothing parameters to try
# thresholds = seq(0.1, 0.6, 0.02) #classifier thresholds to try for "Yes" based on ~0.16 proportion yes to no
# iters = 100 #train/test split iterations
# splitPerc = 0.7 #percentage of data for training
# define shortened parameters for demo purposes
modelIdx_list = list(
c(12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36), # Combination 2 (my hand-picked variables)
c(4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36) # Combination 4 (the *best*)
)
alphas = seq(0.1, 0.9, 0.2) #smoothing parameters to try
thresholds = seq(0.3, 0.5, 0.05) #classifier thresholds to try for "Yes" based on ~0.16 proportion yes to no
iters = 10 #train/test split iterations
splitPerc = 0.7 #percentage of data for training
# initialize a list to store results
results_list = list()
# time the loop
loop_start = Sys.time()
# loop through modelIdx combinations
for (modelIdx in modelIdx_list) {
# loop through the alphas
for (alpha in alphas) {
# loop through the thresholds
for (threshold in thresholds) {
# time the threshold loop
thresh_start = Sys.time()
# make holders for the stats from each iteration for this modelIdx, alpha, and threshold
iteration_stats = matrix(nrow = iters, ncol = 5)
colnames(iteration_stats) = c("Seed", "Threshold", "Alpha", "Sensitivity", "Specificity")
# counter for row index in iteration_stats
row_counter = 1
# external cross-validation loop
for (i in 1:iters) {
set.seed(i)
# generate the test train split
trainIdx = sample(1:nrow(cs2_conv), round(splitPerc * nrow(cs2_conv)))
AttritionTrn = cs2_conv[trainIdx, ]
AttritionTst = cs2_conv[-trainIdx, ]
# train Naive Bayes model with current alpha and feature (modelIdx) set
modelNB = naiveBayes(AttritionTrn[, modelIdx], AttritionTrn$Attrition, laplace = alpha)
# predict on testing set
preds = predict(modelNB, AttritionTst[, modelIdx], type = "raw") # raw returns the probabilities
attrition_probs = preds[, "Yes"]
predictions = ifelse(attrition_probs >= threshold, "Yes", "No")
# calculate confusion matrix
CM_no = confusionMatrix(table(predictions, AttritionTst$Attrition))
# get sensitivity and specificity for "No" class
sens_no = CM_no[4]$byClass["Sensitivity"] # because reference (positive class) is set to no
spec_no = CM_no[4]$byClass["Specificity"]
# store results for that iteration
iteration_stats[row_counter, ] = c(i, threshold, alpha, sens_no, spec_no)
row_counter = row_counter + 1
} # complete external cross-validation loop
# initialize variables to store predictions and actual values to calculate confusion matrix after all LOO
all_predictions = c()
all_actual = c()
# internal cross-validation loop (Leave-One-Out Cross-Validation)
for (row in 1:nrow(cs2_conv)) {
# generate the test train split
AttritionTrn = cs2_conv[-row, ]
AttritionTst = cs2_conv[row, ]
# train Naive Bayes model with current alpha and feature (modelIdx) set
modelNB = naiveBayes(AttritionTrn[, modelIdx], AttritionTrn$Attrition, laplace = alpha)
# predict on testing set
preds = predict(modelNB, AttritionTst[, modelIdx], type = "raw") # raw returns the probabilities
attrition_probs = preds[,"Yes"]
attrition_probs = attrition_probs[["Yes"]] # get just the probability from the named number
predictions = ifelse(attrition_probs >= threshold, "Yes", "No")
# accumulate predictions and actual values
all_predictions = c(all_predictions, predictions)
all_actual = c(all_actual, as.character(AttritionTst$Attrition)) # turn factors to characters to match all_actual
} # complete internal cross-validation loop
# calculate confusion matrix using all predictions and actual values
CM_no = confusionMatrix(table(all_predictions, all_actual))
# get sensitivity and specificity for "No" class
sens_icv = CM_no[4]$byClass["Sensitivity"] # because reference (positive class) is set to no
spec_icv = CM_no[4]$byClass["Specificity"]
# calculate mean sensitivity and specificity over ECV and ICV for this combination of modelIdx, alpha, and threshold
# not sure if this is the best way since there are many more test/train splits in ICV
# also don't know if people ever really average ECV and ICV results
avg_sens = mean(c(mean(iteration_stats[, "Sensitivity"]), sens_icv))
avg_spec = mean(c(mean(iteration_stats[, "Specificity"]), spec_icv))
# store results for this combination of modelIdx, alpha, and threshold
results_list = c(results_list, list(
data.frame(
ModelIdx = toString(modelIdx),
Alpha = alpha,
Threshold = threshold,
AvgSensitivity = avg_sens,
AvgSpecificity = avg_spec,
AvgSum = avg_sens + avg_spec
)
))
# time the threshold loop
thresh_end = Sys.time()
time_taken_thresh = difftime(thresh_end, thresh_start, units = "secs")
cat("Threshold Loop Time:", time_taken_thresh, "secs \n")
cat("ModelIdx:", toString(modelIdx), "- Alpha:", alpha, "- Threshold:", threshold, "\n")
} # complete threshold loop
} # complete alpha loop
} # complete modelIdx loop
Threshold Loop Time: 5.295719 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.1 - Threshold: 0.3
Threshold Loop Time: 6.194655 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.1 - Threshold: 0.35
Threshold Loop Time: 5.935995 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.1 - Threshold: 0.4
Threshold Loop Time: 5.645946 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.1 - Threshold: 0.45
Threshold Loop Time: 6.093287 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.1 - Threshold: 0.5
Threshold Loop Time: 7.394172 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.3 - Threshold: 0.3
Threshold Loop Time: 6.354893 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.3 - Threshold: 0.35
Threshold Loop Time: 6.575595 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.3 - Threshold: 0.4
Threshold Loop Time: 8.486059 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.3 - Threshold: 0.45
Threshold Loop Time: 6.610982 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.3 - Threshold: 0.5
Threshold Loop Time: 6.496804 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.5 - Threshold: 0.3
Threshold Loop Time: 5.527282 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.5 - Threshold: 0.35
Threshold Loop Time: 5.098834 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.5 - Threshold: 0.4
Threshold Loop Time: 5.023465 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.5 - Threshold: 0.45
Threshold Loop Time: 5.168213 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.5 - Threshold: 0.5
Threshold Loop Time: 6.979724 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.7 - Threshold: 0.3
Threshold Loop Time: 5.39098 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.7 - Threshold: 0.35
Threshold Loop Time: 5.22696 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.7 - Threshold: 0.4
Threshold Loop Time: 5.094164 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.7 - Threshold: 0.45
Threshold Loop Time: 5.199527 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.7 - Threshold: 0.5
Threshold Loop Time: 5.625987 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.9 - Threshold: 0.3
Threshold Loop Time: 5.951124 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.9 - Threshold: 0.35
Threshold Loop Time: 6.278088 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.9 - Threshold: 0.4
Threshold Loop Time: 5.896067 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.9 - Threshold: 0.45
Threshold Loop Time: 7.065845 secs
ModelIdx: 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.9 - Threshold: 0.5
Threshold Loop Time: 7.367793 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.1 - Threshold: 0.3
Threshold Loop Time: 10.19007 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.1 - Threshold: 0.35
Threshold Loop Time: 7.363244 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.1 - Threshold: 0.4
Threshold Loop Time: 7.143034 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.1 - Threshold: 0.45
Threshold Loop Time: 7.580932 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.1 - Threshold: 0.5
Threshold Loop Time: 5.714779 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.3 - Threshold: 0.3
Threshold Loop Time: 6.387567 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.3 - Threshold: 0.35
Threshold Loop Time: 6.194161 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.3 - Threshold: 0.4
Threshold Loop Time: 7.325557 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.3 - Threshold: 0.45
Threshold Loop Time: 6.228531 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.3 - Threshold: 0.5
Threshold Loop Time: 6.639477 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.5 - Threshold: 0.3
Threshold Loop Time: 6.100348 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.5 - Threshold: 0.35
Threshold Loop Time: 6.430035 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.5 - Threshold: 0.4
Threshold Loop Time: 6.722585 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.5 - Threshold: 0.45
Threshold Loop Time: 6.185506 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.5 - Threshold: 0.5
Threshold Loop Time: 5.689047 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.7 - Threshold: 0.3
Threshold Loop Time: 5.658677 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.7 - Threshold: 0.35
Threshold Loop Time: 6.963918 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.7 - Threshold: 0.4
Threshold Loop Time: 6.851558 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.7 - Threshold: 0.45
Threshold Loop Time: 6.431898 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.7 - Threshold: 0.5
Threshold Loop Time: 6.668972 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.9 - Threshold: 0.3
Threshold Loop Time: 6.705194 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.9 - Threshold: 0.35
Threshold Loop Time: 7.238828 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.9 - Threshold: 0.4
Threshold Loop Time: 6.780722 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.9 - Threshold: 0.45
Threshold Loop Time: 6.329726 secs
ModelIdx: 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 - Alpha: 0.9 - Threshold: 0.5
# loop time
loop_end = Sys.time()
time_taken_loop = difftime(loop_end, loop_start, units = "mins")
cat("Total Loop Time:", time_taken_loop, "mins \n")
Total Loop Time: 5.326095 mins
# convert results_list to a data frame
results_df = do.call(rbind, results_list)
# output top 10 combinations for each ModelIdx
top_10_combos = results_df %>%
arrange(ModelIdx, desc(AvgSum), desc(AvgSpecificity), desc(AvgSensitivity)) %>%
group_by(ModelIdx) %>%
#top_n(10, wt = AvgSum)
#make a shorter version for demo purpose
top_n(5, wt = AvgSum)
kable(top_10_combos, caption = "demo top 10 sensitivity + specificity tuned parameter combos") %>% kable_styling(full_width = FALSE, position = "left")
| ModelIdx | Alpha | Threshold | AvgSensitivity | AvgSpecificity | AvgSum |
|---|---|---|---|---|---|
| 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.1 | 0.50 | 0.8390322 | 0.6387663 | 1.477798 |
| 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.5 | 0.50 | 0.8415109 | 0.6349447 | 1.476455 |
| 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.3 | 0.50 | 0.8403885 | 0.6349447 | 1.475333 |
| 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.9 | 0.50 | 0.8417456 | 0.6335161 | 1.475262 |
| 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.7 | 0.50 | 0.8401521 | 0.6349447 | 1.475097 |
| 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.7 | 0.45 | 0.8244917 | 0.6559893 | 1.480481 |
| 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.9 | 0.45 | 0.8242569 | 0.6559893 | 1.480246 |
| 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.5 | 0.45 | 0.8235836 | 0.6559893 | 1.479573 |
| 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.3 | 0.45 | 0.8228953 | 0.6559893 | 1.478884 |
| 4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.1 | 0.45 | 0.8210799 | 0.6547697 | 1.475850 |
Based on this parameter tuning, the highest sum of sensitivity
and specificity scores my Naive Bayes model achieves is 148.79
(Sensitivity of 83.11% and Specificity of 65.68%). The optimization was
achieved with
ModelIdx = c(4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36),
which are the indexes of the columns in cs2_conv,
alpha = 0.7 and
threshold (of "Yes" >=) = 0.46. The variables are
BusinessTravel, DistanceFromHome,
EnvironmentSatisfaction, JobInvolvement,
JobLevel, JobRole, MaritalStatus,
MonthlyIncome, OverTime,
StockOptionLevel, TotalWorkingYears,
WorkLifeBalance, YearsAtCompany,
YearsInCurrentRole, YearsWithCurrManager. Now,
I will train my model using these and the full data set and use that to
predict the labels on the competition set.
I would like to compare a k-Nearest Neighbors (k-NN)
model to the Naive Bayes model. I want to see if I can use k-NN with
both numerical and categorical variables, or just numerical (maybe
convert the numerical factors back to numbers). Another option would be
to fit separate models to each category in the categorical variables. I
may want to experiment with the explanatory variables used, scale the
variables, try ECV + ICV, tune k, tune the threshold or over or
under-sample. I start by transforming the variables.
# starting with the original dataset `cs2` since the ordinal categorical variables are in an integer class
# there are a few other variables that I could transform into ranked like business travel
# not sure it is valid to do it for marital status or overtime
# make a copy of the original dataset
numerical_df = cs2
# check the data
# head(numerical_df$BusinessTravel)
# reorder factor levels
numerical_df$BusinessTravel = factor(numerical_df$BusinessTravel, levels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently"), labels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently"))
# rename the levels to numbers
levels(numerical_df$BusinessTravel) = c(1:3)
# convert factor to integer
numerical_df$BusinessTravel = as.integer(numerical_df$BusinessTravel)
# check the conversion
# head(numerical_df$BusinessTravel)
# repeat for marital status and overtime :/
# head(numerical_df$MaritalStatus)
numerical_df$MaritalStatus = factor(numerical_df$MaritalStatus, levels = c("Single", "Married", "Divorced"), labels = c("Single", "Married", "Divorced"))
levels(numerical_df$MaritalStatus) = c(1:3)
numerical_df$MaritalStatus = as.integer(numerical_df$MaritalStatus)
# head(numerical_df$MaritalStatus)
# head(numerical_df$OverTime)
numerical_df$OverTime = factor(numerical_df$OverTime, levels = c("Yes", "No"), labels = c("Yes", "No"))
levels(numerical_df$OverTime) = c(1:2)
numerical_df$OverTime = as.integer(numerical_df$OverTime)
# head(numerical_df$OverTime)
# scale the variables
# str(numerical_df)
# list of columns to exclude from scaling
exclude_columns = c("ID", "EmployeeCount", "EmployeeNumber", "StandardHours")
# scale the numeric variables and reassign values to same columns
numerical_df = numerical_df %>%
mutate_at(vars(-matches(paste(exclude_columns, collapse = "|"))),
.funs = function(x) if(is.numeric(x)) scale(x) else x)
# str(numerical_df)
This is a modified version of the NB variable
optimization code block from above. I start with the variables with
evidence of signficant (p < 0.05) differences between the attrition
groups. They are either integers or I can reasonably order their levels
and tranform them into integers. If they can’t reasonably be ordered,
for example department and job role, I don’t include them. This loop
finds all the unique combinations of a specified number of these scaled,
numeric variables. It iterates through a number of train/test splits and
finds the mean sensitivity, specificity and sum of those for each
combination for that number of variables. It returns the top ten
combinations for that number of variables. Then each loop increases the
number of variables used in the model by one.
# this is a computationally heavy, lengthy process depending on iterations.
modelIdx_num = c(2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36) # sig numeric vars (16)
# Function to calculate sensitivity, specificity for all combinations of a given number of variables
calculate_combinations <- function(modelIdx, max_vars = 15, iters = 10, splitPerc = 0.7) {
# Initialize an empty list to store results
results_list <- list()
# Initialize a variable to keep track of the maximum specificity reached
# max_specificity_so_far <- 0
# Initialize a variable to store the maximum mean specificity from previous iterations
# max_mean_specificity_previous <- 0
# Loop over different number of variables
for (num_vars in seq(8, max_vars)) {
cat("Number of variables:", num_vars, "\n")
start_time <- Sys.time()
# Generate all combinations of variables
index_combinations <- combinat::combn(modelIdx, num_vars)
# Initialize an empty list to store results for this iteration
temp_results_list <- list()
# Loop through each combination of variables
for (i in 1:ncol(index_combinations)) {
index_set <- index_combinations[, i]
# Create holders for the sensitivity, specificity for each iteration
masterSens <- numeric(iters)
masterSpec <- numeric(iters)
# Iterate over the specified number of iterations
for (j in 1:iters) {
set.seed(j)
# Sample 70% for training
trainIdx <- sample(1:dim(numerical_df)[1], round(splitPerc * dim(numerical_df)[1]))
# Select the rows for training
AttritionTrn <- numerical_df[trainIdx, ]
# Select the remaining rows for testing
AttritionTst <- numerical_df[-trainIdx, ]
# Temporarily set the current indexes for modeling
temp_modelIdx <- index_set
# Train the model
classifications <- knn(AttritionTrn[, temp_modelIdx], AttritionTst[, temp_modelIdx], AttritionTrn$Attrition, prob = TRUE, k = 15)
# Predict using the model
# preds <- predict(modelNB, AttritionTst[, temp_modelIdx])
# Confusion matrix
CM <- confusionMatrix(table(classifications, AttritionTst$Attrition))
# Store sensitivity, specificity
masterSens[j] <- CM$byClass["Sensitivity"]
masterSpec[j] <- CM$byClass["Specificity"]
}
# Average sensitivity, specificity
avg_sensitivity <- mean(masterSens)
avg_specificity <- mean(masterSpec)
# Store the results including the combination of variables
temp_results_list[[i]] <- list(variables = index_set, Sensitivity = avg_sensitivity, Specificity = avg_specificity)
}
end_time <- Sys.time()
time_taken <- end_time - start_time
cat("Time taken:", time_taken, "\n")
# Store the results for this iteration
results_list[[num_vars]] <- temp_results_list
# Get the maximum mean specificity for this number of variables
max_mean_specificity_current <- max(sapply(temp_results_list, function(x) x$Specificity))
# Check if the maximum mean specificity for the current number of variables is less than the previous maximum
# if (max_mean_specificity_current < max_mean_specificity_previous) {
# cat("Maximum specificity reached for", num_vars, "variables is less than previous maximum. Stopping further iterations.\n")
# break
# }
# Update the maximum mean specificity from previous iterations
# max_mean_specificity_previous <- max_mean_specificity_current
}
return(results_list)
}
# Define the list of indexes to loop through
modelIdx <- c(2,4,7,12,15,16,18,19,20,24,29,30,32,33,34,36) #significant features
# Define the maximum number of variables
max_vars <- 15
# Call the function to calculate combinations
results <- calculate_combinations(modelIdx, max_vars)
# Determine the maximum expected number of combinations
max_combinations <- (max_vars - 8) * 10 # Assuming a maximum of 10 combinations per number of variables
# Preallocate the output data frame with maximum expected number of rows
output_df <- data.frame(Number_of_Variables = integer(max_combinations),
Variable_Indexes = character(max_combinations),
Mean_Sensitivity = numeric(max_combinations),
Mean_Specificity = numeric(max_combinations),
Sum_of_Sensitivity_and_Specificity = numeric(max_combinations),
stringsAsFactors = FALSE)
# Loop over different number of variables
for (num_vars in seq(8, max_vars)) {
cat("\nTop 10 combinations with the highest specificity for", num_vars, "variables:\n")
sorted_indices <- order(sapply(results[[num_vars]], function(x) x$Specificity), decreasing = TRUE)
for (i in 1:min(10, length(sorted_indices))) {
index <- sorted_indices[i]
# cat("Combination", i, ": Variables =", results[[num_vars]][[index]]$variables, ", Specificity =", results[[num_vars]][[index]]$Specificity, "\n")
# Assign the results to the preallocated rows in the output data frame
row_index <- (num_vars - 8) * 10 + i # Calculate the row index
output_df$Number_of_Variables[row_index] <- num_vars
output_df$Variable_Indexes[row_index] <- paste(results[[num_vars]][[index]]$variables, collapse = ", ")
output_df$Mean_Sensitivity[row_index] <- results[[num_vars]][[index]]$Sensitivity
output_df$Mean_Specificity[row_index] <- results[[num_vars]][[index]]$Specificity
output_df$Sum_of_Sensitivity_and_Specificity[row_index] <- results[[num_vars]][[index]]$Sensitivity + results[[num_vars]][[index]]$Specificity
}
}
# Print the final output data frame
print(output_df)
From a pool of these variables, Age,
BusinessTravel, DistanceFromHome,
EnvironmentSatisfaction, JobInvolvement,
JobLevel, JobRole,
JobSatisfaction, MaritalStatus,
MonthlyIncome, OverTime,
StockOptionLevel, TotalWorkingYears,
WorkLifeBalance, YearsAtCompany,
YearsInCurrentRole, YearsWithCurrManager, I
try to predict which combinations might yield the highest sensitivity
and specificity. Overall, it doesn’t look very promising, perhaps
because I didn’t adjust the threshold or tune the parameter k. I choose
generally the best performing set of increasing numbers of variables to
use in the next block of code, where I will tune the parameters.
Using several of the what seem likely to be the best
combinations of explanatory variables, I tune several parameters in the
k-NN model. I try to adjust and tune the thresholding, because the
dataset is unbalanced with only 16% “Yes” in the Attrition variable. I
also tune the number of neighbors, k. I evaluate all combinations of
variables and tuning parameters with both external and internal
validation. There is probably a better, clearer way to tune k.
# I comment out some parameters to make a demo version of this code, so that it doesn't take so long to run.
# define parameters
modelIdx_list = list(
c(2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36), # sig numeric vars (16)
# c(12,16,19,24,29,33,34,36), #8var
# c(16,18,20,24,29,33,34,36), #8var
# c(2,12,16,19,24,29,33,34,36), #9var
# c(2,12,16,19,24,29,30,33,34,36), #10var
# c(2,12,16,19,20,24,29,30,33,34,36), #11var
# c(2,7,16,18,19,20,24,29,30,33,34,36), #12var
# c(2,7,12,16,18,19,20,24,29,30,33,34,36), #13var
c(2,7,12,15,16,18,19,20,24,29,30,33,34,36), #14var
# c(2,7,12,15,16,18,19,20,24,29,30,32,33,34,36), #15var
# c(12,15,16,19,20,24,29,30,32,33,34,36), # NB Combination 2 (my hand-picked variables) minus job role
c(4,7,12,15,16,19,20,24,29,30,32,33,34,36) # NB Combination 4 (the *best*) minus job role
# c(2, 7, 20, 22, 25, 30, 31, 33, 34, 35, 36) # all numeric variables
)
# actual set of parameters used
# thresholds = seq(0.1, 0.6, 0.02) #classifier thresholds to try for "Yes" based on ~0.16 proportion yes to no
# numKs = seq(1, 49, 2) #odd Ks to try
# iters = 100 #train/test split iterations
# splitPerc = 0.7 #percentage of data for training
# demo set
thresholds = seq(0.1, 0.5, 0.05) #classifier thresholds to try for "Yes" based on ~0.16 proportion yes to no
numKs = seq(3, 49, 2) #odd Ks to try
iters = 10 #train/test split iterations
splitPerc = 0.7 #percentage of data for training
# for debugging
# modelIdx = c(2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36)
# threshold = 0.5
# k = 11
# i = 1
# initialize a list to store results
results_list = list()
# time the loop
loop_start = Sys.time()
cat("Start time:", loop_start, "\n")
Start time: 1713729397
# loop through modelIdx combinations
for (modelIdx in modelIdx_list) {
# loop through the thresholds
for (threshold in thresholds) {
# loop through the Ks
for (k in numKs) {
# time the K loop
k_start = Sys.time()
# make holders for the stats from each iteration for this modelIdx, threshold and K
iteration_stats = matrix(nrow = iters, ncol = 5)
colnames(iteration_stats) = c("Seed", "Threshold", "K", "Sensitivity", "Specificity")
# counter for row index in iteration_stats
row_counter = 1
# external cross-validation loop
for (i in 1:iters) {
set.seed(i)
# generate the test train split
trainIdx = sample(1:nrow(numerical_df), round(splitPerc * nrow(numerical_df)))
AttritionTrn = numerical_df[trainIdx, ]
AttritionTst = numerical_df[-trainIdx, ]
# train kNN model with current k and feature (modelIdx) set
classifications = knn(AttritionTrn[, modelIdx], AttritionTst[, modelIdx], AttritionTrn$Attrition, prob = T, k = k)
# create a table of predicted vs actual values for assessing classification accuracy
#table(classifications, AttritionTst$Attrition)
# display the classifications and their attributes; useful for debugging and understanding model output
classifications
attributes(classifications)
# compute probabilities specifically for the "YES" class, adjusting based on predicted labels
probs = ifelse(classifications == "YES", attributes(classifications)$prob, 1 - attributes(classifications)$prob)
# apply the new threshold to reclassify observations
NewClass = ifelse(probs >= threshold, "Yes", "No")
# this didn't work if all the predictions were No because the Yes row was missing
# tabulate the new classifications against actual values
# table(NewClass, AttritionTst$Attrition)
# calculate the confusion matrix
# don't need to relevel the confusion matrix because I am looking at sensitivity + specificity
# CM_no = confusionMatrix(table(relevel(as.factor(NewClass), ref = "No"), AttritionTst$Attrition), mode = "everything")
# create a table with rows for both "Yes" and "No" classes (in the event there are zero "Yes"s)
table_predicted_actual = matrix(0, nrow = 2, ncol = 2, dimnames = list(c("No", "Yes"), c("No", "Yes")))
# populate the counts based on the predictions and actual classes
table_predicted_actual["No", "No"] = sum(NewClass == "No" & AttritionTst$Attrition == "No")
table_predicted_actual["No", "Yes"] = sum(NewClass == "No" & AttritionTst$Attrition == "Yes")
table_predicted_actual["Yes", "No"] = sum(NewClass == "Yes" & AttritionTst$Attrition == "No")
table_predicted_actual["Yes", "Yes"] = sum(NewClass == "Yes" & AttritionTst$Attrition == "Yes")
# calculate the confusion matrix for this iteration
CM_no = confusionMatrix(table_predicted_actual, mode = "everything")
# get sensitivity and specificity for "No" class
sens_no = CM_no[4]$byClass["Sensitivity"] # because reference (positive class) is set to no
spec_no = CM_no[4]$byClass["Specificity"]
# store results for that iteration
iteration_stats[row_counter, ] = c(i, threshold, k, sens_no, spec_no)
row_counter = row_counter + 1
} # complete external cross-validation loop
# initialize variables to store predictions and actual values to calculate confusion matrix after all LOO
all_predictions = c()
all_actual = c()
# internal cross-validation (Leave-One-Out Cross-Validation)
classifications = knn.cv(numerical_df[, modelIdx], numerical_df$Attrition,
prob = TRUE, k = k)
# create a table of predicted vs actual values for assessing classification accuracy
#table(classifications, AttritionTst$Attrition)
# display the classifications and their attributes; useful for debugging and understanding model output
classifications
attributes(classifications)
# compute probabilities specifically for the "YES" class, adjusting based on predicted labels
probs = ifelse(classifications == "YES", attributes(classifications)$prob, 1 - attributes(classifications)$prob)
# apply the new threshold to reclassify observations
NewClass = ifelse(probs >= threshold, "Yes", "No")
# tabulate the new classifications against actual values
# table(NewClass, AttritionTst$Attrition)
# calculate the confusion matrix
# don't need to relevel the confusion matrix because I am looking about sensitivity + specificity
# CM_no = confusionMatrix(table(relevel(as.factor(NewClass), ref = "No"), cs2_conv$Attrition), mode = "everything")
# create a table with rows for both "Yes" and "No" classes (in the event there are zero "Yes"s)
table_predicted_actual = matrix(0, nrow = 2, ncol = 2, dimnames = list(c("No", "Yes"), c("No", "Yes")))
# populate the counts based on the predictions and actual classes
table_predicted_actual["No", "No"] = sum(NewClass == "No" & cs2_conv$Attrition == "No")
table_predicted_actual["No", "Yes"] = sum(NewClass == "No" & cs2_conv$Attrition == "Yes")
table_predicted_actual["Yes", "No"] = sum(NewClass == "Yes" & cs2_conv$Attrition == "No")
table_predicted_actual["Yes", "Yes"] = sum(NewClass == "Yes" & cs2_conv$Attrition == "Yes")
# calculate the confusion matrix for this iteration
CM_no = confusionMatrix(table_predicted_actual, mode = "everything")
# get sensitivity and specificity for "No" class
sens_icv = CM_no[4]$byClass["Sensitivity"] # because reference (positive class) is set to no
spec_icv = CM_no[4]$byClass["Specificity"]
# calculate mean sensitivity and specificity over ECV and ICV for this combination of modelIdx, threshold and k
# not sure if I should even average these??
avg_sens = mean(c(mean(iteration_stats[, "Sensitivity"]), sens_icv))
avg_spec = mean(c(mean(iteration_stats[, "Specificity"]), spec_icv))
# store results for this combination of modelIdx, threshold and K
results_list = c(results_list, list(
data.frame(
ModelIdx = toString(modelIdx),
Threshold = threshold,
K = k,
AvgSensitivity = avg_sens,
AvgSpecificity = avg_spec,
AvgSum = avg_sens + avg_spec
)
))
# time the k loop
k_end = Sys.time()
time_taken_thresh = difftime(k_end, k_start, units = "secs")
# cat("K Loop Time:", time_taken_thresh, "secs \n")
# cat("ModelIdx:", toString(modelIdx), "- Threshold:", threshold, "- K:", k, "\n")
} # complete k loop
} # complete threshold loop
# print progress report
progress_time = Sys.time()
progress = difftime(progress_time, loop_start, units = "mins")
cat("Total Elapsed Time:", progress, "mins \n")
} # complete modelIdx loop
Total Elapsed Time: 0.4169225 mins
Total Elapsed Time: 0.7708993 mins
Total Elapsed Time: 1.122256 mins
# loop time
# loop_end = Sys.time()
# time_taken_loop = difftime(loop_end, loop_start, units = "mins")
# cat("Total Loop Time:", time_taken_loop, "mins \n")
# convert results_list to a data frame
results_df = do.call(rbind, results_list)
# plot the optimal k (though not particularly helpful or needed with the other tuning)
# aggregate results by K
results_agg = results_df %>%
group_by(K) %>%
dplyr::summarize(Mean_Sensitivity = mean(AvgSensitivity),
Mean_Specificity = mean(AvgSpecificity))
# plot the average sensitivity and specificity for each K
ggplot(results_agg, aes(x = K)) +
geom_line(aes(y = Mean_Sensitivity, color = "Mean Sensitivity")) +
geom_line(aes(y = Mean_Specificity, color = "Mean Specificity")) +
labs(title = "Mean sensitivity and specificity for different values of K",
x = "K parameter", y = "Metric", color = "Metric") +
scale_color_manual(values = c("Mean Sensitivity" = "darkgreen", "Mean Specificity" = "darkblue")) +
theme_bw()
# output top 10 combinations for each ModelIdx
top_10_combos = results_df %>%
arrange(ModelIdx, desc(AvgSum), desc(AvgSpecificity), desc(AvgSensitivity)) %>%
group_by(ModelIdx) %>%
top_n(10, wt = AvgSum)
kable(top_10_combos, caption = "demo top 10 sensitivity + specificity tuned knn parameters") %>% kable_styling(full_width = FALSE, position = "left")
| ModelIdx | Threshold | K | AvgSensitivity | AvgSpecificity | AvgSum |
|---|---|---|---|---|---|
| 2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 41 | 0.8368230 | 0.6553327 | 1.492156 |
| 2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 43 | 0.8195869 | 0.6709875 | 1.490574 |
| 2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 39 | 0.7915156 | 0.6956844 | 1.487200 |
| 2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 49 | 0.8248028 | 0.6613105 | 1.486113 |
| 2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 45 | 0.8019084 | 0.6836708 | 1.485579 |
| 2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 47 | 0.8388773 | 0.6412330 | 1.480110 |
| 2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 37 | 0.8068725 | 0.6692766 | 1.476149 |
| 2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 33 | 0.7747236 | 0.6965216 | 1.471245 |
| 2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 35 | 0.8256930 | 0.6421069 | 1.467800 |
| 2, 4, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.10 | 43 | 0.6672257 | 0.7987358 | 1.465962 |
| 2, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 33, 34, 36 | 0.15 | 41 | 0.8100142 | 0.6421706 | 1.452185 |
| 2, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 33, 34, 36 | 0.15 | 43 | 0.7889211 | 0.6613497 | 1.450271 |
| 2, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 33, 34, 36 | 0.15 | 29 | 0.7937247 | 0.6544004 | 1.448125 |
| 2, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 33, 34, 36 | 0.15 | 27 | 0.8173574 | 0.6294154 | 1.446773 |
| 2, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 33, 34, 36 | 0.15 | 47 | 0.8091314 | 0.6371005 | 1.446232 |
| 2, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 33, 34, 36 | 0.20 | 49 | 0.8734955 | 0.5707366 | 1.444232 |
| 2, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 33, 34, 36 | 0.15 | 31 | 0.7705838 | 0.6735651 | 1.444149 |
| 2, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 33, 34, 36 | 0.20 | 43 | 0.8764539 | 0.5676696 | 1.444123 |
| 2, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 33, 34, 36 | 0.20 | 39 | 0.8709846 | 0.5730479 | 1.444033 |
| 2, 7, 12, 15, 16, 18, 19, 20, 24, 29, 30, 33, 34, 36 | 0.20 | 33 | 0.8676089 | 0.5762523 | 1.443861 |
| 4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 33 | 0.7583289 | 0.7413052 | 1.499634 |
| 4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 49 | 0.8057511 | 0.6937896 | 1.499541 |
| 4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 39 | 0.7694411 | 0.7289421 | 1.498383 |
| 4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 47 | 0.8191319 | 0.6781339 | 1.497266 |
| 4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 45 | 0.7807184 | 0.7157678 | 1.496486 |
| 4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 43 | 0.7962160 | 0.7000963 | 1.496312 |
| 4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 37 | 0.7892095 | 0.7068075 | 1.496017 |
| 4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 31 | 0.7746572 | 0.7148029 | 1.489460 |
| 4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 35 | 0.8105285 | 0.6769495 | 1.487478 |
| 4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36 | 0.15 | 41 | 0.8100356 | 0.6700381 | 1.480074 |
Based on this parameter tuning, the highest sum of sensitivity
and specificity scores my k-NN model achieves is 148.78 (Sensitivity of
75.16% and Specificity of 73.62%). The optimization was achieved with
ModelIdx = c(4, 7, 12, 15, 16, 19, 20, 24, 29, 30, 32, 33, 34, 36),
which are the indexes of the columns in cs2 (and
numerical_df) and k = 41 and
threshold (of "Yes" >=) = 0.14. This is the same list of
variables that achieve the best performance (148.79) in the Naive Bayes
model, minus JobRole. Sensitivity and specificity are more
balanced in this model, probably because the threshold here is quite
low, 0.14 versus 0.46 in the Naive Bayes model. I think I will use the
Naive Bayes model for the final predictions because it has a tiny bit
higher sum of sensitivity and specificity (only 0.01), without using
what seems like a very low threshold 0.14, when the percentage of
Attrition in the actual data set is ~16%. I would prefer to use the
simpler model. While the Naive Bayes model does use one more variable, I
think the work of pre-processing variables for k-NN offsets the extra
variable.
This R Shiny app displays the distributions of monthly
income by attrition group in a boxplot and histograms. A user can select
individual or combinations of job roles and also a salary range. The app
will interactively update with the distributions for the user selections
and will display the percent of selected employees in each attrition
group.
R Shiny Employee Attrition and Income App is here: https://kdhenderson.shinyapps.io/Employee_Attrition_and_Income/.
I import the unlabeled competition data set from the
cloud as competition. I convert the numerical ranked
variables to factors. So that my column indexes are the same as those in
the primary data set, I add an Attrition column. I train a
Naive Bayes model on the entire labeled primary data set using the tuned
alpha parameter. Then I adjust the classifications with the tuned
threshold.
# make predictions with Naive Bayes model
# variables used: `BusinessTravel`, `DistanceFromHome`, `EnvironmentSatisfaction`, `JobInvolvement`, `JobLevel`, `JobRole`, `MaritalStatus`, `MonthlyIncome`, `OverTime`, `StockOptionLevel`, `TotalWorkingYears`, `WorkLifeBalance`, `YearsAtCompany`, `YearsInCurrentRole`, `YearsWithCurrManager`
# parameters
# train_dataframe = cs2_conv
# adjust index numbers based on missing Attrition column #3 or add column to competition set
ModelIdx = c(4, 7, 12, 15, 16, 17, 19, 20, 24, 29, 30, 32, 33, 34, 36)
alpha = 0.7
threshold = 0.46 # "Yes" >=
# import data: CaseStudy2CompSet No Attrition.csv from AWS S3 msdsds6306 bucket
url = "https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2CompSet+No+Attrition.csv"
# encoded_url = URLencode(url) # need to encode the spaces as %20 (or do it manually)
competition = read.table(textConnection(getURL(url)), sep = ",", header = TRUE, stringsAsFactors = TRUE)
# save a copy of the competition unlabeled data
# write.csv(competition, file = "data/CaseStudy2CompSet No Attrition.csv", row.names = FALSE)
# get a sense of the data
str(competition)
'data.frame': 300 obs. of 35 variables:
$ ID : int 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 ...
$ Age : int 35 33 26 55 29 51 52 39 31 31 ...
$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 3 3 3 2 1 3 3 2 ...
$ DailyRate : int 750 147 1330 1311 1246 1456 585 1387 1062 534 ...
$ Department : Factor w/ 3 levels "Human Resources",..: 2 1 2 2 3 2 3 2 2 2 ...
$ DistanceFromHome : int 28 2 21 2 19 1 29 10 24 20 ...
$ Education : int 3 3 3 3 3 4 4 5 3 3 ...
$ EducationField : Factor w/ 6 levels "Human Resources",..: 2 1 4 2 2 4 2 4 4 2 ...
$ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
$ EmployeeNumber : int 1596 1207 1107 505 1497 145 2019 1618 1252 587 ...
$ EnvironmentSatisfaction : int 2 2 1 3 3 1 1 2 3 1 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 2 2 1 2 ...
$ HourlyRate : int 46 99 37 97 77 30 40 76 96 66 ...
$ JobInvolvement : int 4 3 3 3 2 2 3 3 2 3 ...
$ JobLevel : int 2 1 1 4 2 3 1 2 2 3 ...
$ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 3 2 3 4 8 1 9 5 1 1 ...
$ JobSatisfaction : int 3 3 3 4 3 1 4 1 1 3 ...
$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 2 2 1 3 1 3 1 2 3 2 ...
$ MonthlyIncome : int 3407 3600 2377 16659 8620 7484 3482 5377 6812 9824 ...
$ MonthlyRate : int 25348 8429 19373 23258 23757 25796 19788 3835 17198 22908 ...
$ NumCompaniesWorked : int 1 1 1 2 1 3 2 2 1 3 ...
$ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
$ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
$ PercentSalaryHike : int 17 13 20 13 14 20 15 13 19 12 ...
$ PerformanceRating : int 3 3 4 3 3 4 3 3 3 3 ...
$ RelationshipSatisfaction: int 4 4 3 3 3 3 2 4 2 1 ...
$ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
$ StockOptionLevel : int 2 1 1 0 2 0 2 3 0 0 ...
$ TotalWorkingYears : int 10 5 1 30 10 23 16 10 10 12 ...
$ TrainingTimesLastYear : int 3 2 0 2 3 1 3 3 2 2 ...
$ WorkLifeBalance : int 2 3 2 3 3 2 2 3 3 3 ...
$ YearsAtCompany : int 10 5 1 5 10 13 9 7 10 1 ...
$ YearsInCurrentRole : int 9 4 1 4 7 12 8 7 9 0 ...
$ YearsSinceLastPromotion : int 6 1 0 1 0 12 0 7 1 0 ...
$ YearsWithCurrManager : int 8 4 0 2 4 8 0 7 8 0 ...
summary(competition)
ID Age BusinessTravel DailyRate
Min. :1171 Min. :19.00 Non-Travel : 32 Min. : 102.0
1st Qu.:1246 1st Qu.:31.00 Travel_Frequently: 57 1st Qu.: 448.0
Median :1320 Median :36.00 Travel_Rarely :211 Median : 775.0
Mean :1320 Mean :37.86 Mean : 784.8
3rd Qu.:1395 3rd Qu.:44.00 3rd Qu.:1117.0
Max. :1470 Max. :60.00 Max. :1490.0
Department DistanceFromHome Education
Human Resources : 11 Min. : 1.00 Min. :1.000
Research & Development:209 1st Qu.: 2.00 1st Qu.:2.000
Sales : 80 Median : 7.00 Median :3.000
Mean : 9.26 Mean :2.973
3rd Qu.:14.00 3rd Qu.:4.000
Max. :29.00 Max. :5.000
EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
Human Resources : 7 Min. :1 Min. : 2.0 Min. :1.000
Life Sciences :130 1st Qu.:1 1st Qu.: 508.8 1st Qu.:2.000
Marketing : 27 Median :1 Median : 994.5 Median :3.000
Medical : 94 Mean :1 Mean :1020.9 Mean :2.733
Other : 12 3rd Qu.:1 3rd Qu.:1542.5 3rd Qu.:4.000
Technical Degree: 30 Max. :1 Max. :2065.0 Max. :4.000
Gender HourlyRate JobInvolvement JobLevel
Female:105 Min. : 30.00 Min. :1.000 Min. :1.0
Male :195 1st Qu.: 50.00 1st Qu.:2.000 1st Qu.:1.0
Median : 66.00 Median :3.000 Median :2.0
Mean : 66.07 Mean :2.743 Mean :2.2
3rd Qu.: 83.00 3rd Qu.:3.000 3rd Qu.:3.0
Max. :100.00 Max. :4.000 Max. :5.0
JobRole JobSatisfaction MaritalStatus MonthlyIncome
Research Scientist :61 Min. :1.000 Divorced: 65 Min. : 1232
Sales Executive :57 1st Qu.:2.000 Married :128 1st Qu.: 3034
Laboratory Technician :55 Median :3.000 Single :107 Median : 5208
Manufacturing Director :31 Mean :2.767 Mean : 7103
Manager :30 3rd Qu.:4.000 3rd Qu.: 9750
Healthcare Representative:29 Max. :4.000 Max. :19973
(Other) :37
MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
Min. : 2097 Min. :0.000 Y:300 No :212 Min. :11.00
1st Qu.: 8420 1st Qu.:1.000 Yes: 88 1st Qu.:12.00
Median :15091 Median :2.000 Median :14.00
Mean :14499 Mean :2.547 Mean :15.17
3rd Qu.:20330 3rd Qu.:4.000 3rd Qu.:18.00
Max. :26914 Max. :9.000 Max. :25.00
PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
Min. :3.000 Min. :1.000 Min. :80 Min. :0.0000
1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000
Median :3.000 Median :3.000 Median :80 Median :1.0000
Mean :3.153 Mean :2.803 Mean :80 Mean :0.7833
3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000
Max. :4.000 Max. :4.000 Max. :80 Max. :3.0000
TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000
1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000
Median :10.00 Median :2.000 Median :3.000 Median : 5.000
Mean :12.44 Mean :2.683 Mean :2.747 Mean : 7.527
3rd Qu.:18.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:10.000
Max. :38.00 Max. :6.000 Max. :4.000 Max. :37.000
YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 2.00
Median : 3.00 Median : 1.00 Median : 3.00
Mean : 4.33 Mean : 2.29 Mean : 4.38
3rd Qu.: 7.00 3rd Qu.: 3.00 3rd Qu.: 7.00
Max. :18.00 Max. :15.00 Max. :17.00
# convert numerical ranked variables to factors
numVars_to_fact = c("Education", "EnvironmentSatisfaction",
"JobInvolvement", "JobLevel", "JobSatisfaction",
"PerformanceRating", "RelationshipSatisfaction",
"StockOptionLevel", "WorkLifeBalance")
competition = competition %>%
mutate(across(all_of(numVars_to_fact), as.factor))
# check the conversion
# str(competition)
# add the Attrition column between Age and BusinessTravel
competition = competition %>%
mutate(Attrition = NA, .after = "Age")
# str(competition)
# train the Naive Bayes model
fullsetNBmodel = naiveBayes(cs2_conv[, modelIdx], cs2_conv$Attrition, laplace = alpha)
# predict the labels on the competition set
preds = predict(fullsetNBmodel, competition[, modelIdx], type = "raw") # raw returns the probabilities
# get the Yes probabilities to adjust threshold
attrition_probs = preds[, "Yes"]
# classify with new threshold
predictions = ifelse(attrition_probs >= threshold, "Yes", "No")
# add the labels to the Attrition column
competition = competition %>% mutate(Attrition = factor(predictions))
# str(competition)
# get the proportion of attrition for curiousity
summary(competition$Attrition)
No Yes
242 58
NoAttProb = summary(competition$Attrition)["No"] / sum(summary(competition$Attrition))
NoAttProb
No
0.8066667
AttProb = summary(competition$Attrition)["Yes"] / sum(summary(competition$Attrition))
AttProb
Yes
0.1933333
# make dataframe with just competition set ordered IDs and labels
competitionLabels = competition %>% select(ID, Attrition)
str(competitionLabels)
'data.frame': 300 obs. of 2 variables:
$ ID : int 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 ...
$ Attrition: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
# save to file
# write.csv(competitionLabels, file = "data/Case2PredictionsHenderson Attrition.csv", row.names = FALSE)
(with < $3000 RMSE for training and validation set)
To create a new dataframe cs2_regres, I
transform the original data set by reordering and converting factor
variables to integers. Based on the EDA scatter plots, I also derive
some quadratic variables.
# starting with the original dataset `cs2` since the ordinal categorical variables are in an integer class
# there are a few other variables that I could transform into ranked variables, like business travel
# not sure it is valid to do it for marital status or overtime
# make a copy of the original dataset
numerical_df = cs2
# check the data
head(numerical_df$BusinessTravel)
[1] Travel_Rarely Travel_Rarely Travel_Frequently Travel_Rarely
[5] Travel_Frequently Travel_Frequently
Levels: Non-Travel Travel_Frequently Travel_Rarely
# reorder factor levels
numerical_df$BusinessTravel = factor(numerical_df$BusinessTravel, levels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently"), labels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently"))
# rename the levels to numbers
levels(numerical_df$BusinessTravel) = c(1:3)
# convert factor to integer
numerical_df$BusinessTravel = as.integer(numerical_df$BusinessTravel)
# check the conversion
head(numerical_df$BusinessTravel)
[1] 2 2 3 2 3 3
# repeat for marital status and overtime and attrition :/
head(numerical_df$MaritalStatus)
[1] Divorced Single Single Married Single Divorced
Levels: Divorced Married Single
numerical_df$MaritalStatus = factor(numerical_df$MaritalStatus, levels = c("Single", "Married", "Divorced"), labels = c("Single", "Married", "Divorced"))
levels(numerical_df$MaritalStatus) = c(1:3)
numerical_df$MaritalStatus = as.integer(numerical_df$MaritalStatus)
head(numerical_df$MaritalStatus)
[1] 3 1 1 2 1 3
head(numerical_df$OverTime)
[1] No No No No Yes No
Levels: No Yes
numerical_df$OverTime = factor(numerical_df$OverTime, levels = c("Yes", "No"), labels = c("Yes", "No"))
levels(numerical_df$OverTime) = c(1:2)
numerical_df$OverTime = as.integer(numerical_df$OverTime)
head(numerical_df$OverTime)
[1] 2 2 2 2 1 2
head(numerical_df$Attrition)
[1] No No No No No No
Levels: No Yes
numerical_df$Attrition = factor(numerical_df$Attrition, levels = c("Yes", "No"), labels = c("Yes", "No"))
levels(numerical_df$Attrition) = c(1:2)
numerical_df$Attrition = as.integer(numerical_df$Attrition)
head(numerical_df$Attrition)
[1] 2 2 2 2 2 2
# make a new DF, transform x variable (squaring term) and add it as a new column
cs2_regres = numerical_df %>%
mutate(
YearsInCurrentRole2 = YearsInCurrentRole^2,
YearsWithCurrManager2 = YearsWithCurrManager^2)
#indexes from cs2_regres dataframe
ModelIdx = c(2,3,4,5,7,8,12,14,15,16,18,19,21,22,24,25,26,27,29,30,31,32,33,34,35,36,37,38)
To find the variables that produce a model with the
lowest RMSE, I start with an empty model formula. In a loop, I add each
of the variables in the cs2_regres dataframe independently
to build a regression model. I evaluate each model with leave-one-out
cross validation. I store each result in a dataframe, sorted by lowest
RMSE. Using these values as a guide, I manually choose the variable
which performs best. I add it as an explanatory variable to the model
and repeat the process with the remaining variables until I can no
longer achieve a lower RMSE with additional variables.
# unused indexes from cs2_regres dataframe
ModelIdx = c(2,3,4,5,8,14,18,19,21,22,24,25,26,27,29,31,32,33,34,35,37)
# initialize an empty dataframe to store results
results = data.frame(Index = numeric(length(ModelIdx)), Variable = character(length(ModelIdx)), RMSE = numeric(length(ModelIdx)))
# get number of rows in data frame
n = nrow(cs2_regres)
# loop through each explanatory variable separately
for (i in seq_along(ModelIdx)) {
# create formula with the current explanatory variable
model_formula = as.formula(paste("MonthlyIncome ~ JobLevel + TotalWorkingYears + YearsWithCurrManager + YearsWithCurrManager2 + DistanceFromHome + JobInvolvement + EnvironmentSatisfaction + ", names(cs2_regres)[ModelIdx[i]]))
# indexes of variables used in model c(16,30,36,7,15,38,12)
# initialize vector to store squared error from each leave-one-out loop
cv_error_model = numeric(n)
# loop through setting each row as test set for internal n-fold CV
for (j in 1:n) {
# leave out the j-th observation
test_row = cs2_regres[j, ]
# use the remaining data for training
train_data = cs2_regres[-j, ]
# fit the model with train_data
model = lm(model_formula, data = train_data)
# predict on the left-out observation
pred = predict(model, newdata = test_row)
# calculate the squared error
cv_error_model[j] = (test_row$MonthlyIncome - pred)^2
}
# calculate the root mean squared error (RMSE) for the current model
RMSE = sqrt(mean(cv_error_model))
# store the variable index, name, and RMSE in the results dataframe
results[i, "Index"] = ModelIdx[i]
results[i, "Variable"] = names(cs2_regres)[ModelIdx[i]]
results[i, "RMSE"] = RMSE
}
# sort results by RMSE in ascending order
results = results[order(results$RMSE), ]
# print the sorted table
kable(results, caption = "demo of resulting RSME for each 1 added variable") %>% kable_styling(full_width = FALSE, position = "left")
| Index | Variable | RMSE | |
|---|---|---|---|
| 12 | 25 | PercentSalaryHike | 1376.463 |
| 9 | 21 | MonthlyRate | 1376.495 |
| 1 | 2 | Age | 1376.499 |
| 16 | 31 | TrainingTimesLastYear | 1376.594 |
| 13 | 26 | PerformanceRating | 1376.663 |
| 6 | 14 | HourlyRate | 1376.674 |
| 17 | 32 | WorkLifeBalance | 1376.718 |
| 3 | 4 | BusinessTravel | 1376.755 |
| 2 | 3 | Attrition | 1376.770 |
| 10 | 22 | NumCompaniesWorked | 1376.825 |
| 8 | 19 | MaritalStatus | 1376.835 |
| 15 | 29 | StockOptionLevel | 1376.843 |
| 11 | 24 | OverTime | 1376.953 |
| 14 | 27 | RelationshipSatisfaction | 1376.964 |
| 7 | 18 | JobSatisfaction | 1376.974 |
| 5 | 8 | Education | 1377.005 |
| 4 | 5 | DailyRate | 1377.028 |
| 20 | 35 | YearsSinceLastPromotion | 1377.295 |
| 19 | 34 | YearsInCurrentRole | 1377.296 |
| 21 | 37 | YearsInCurrentRole2 | 1377.509 |
| 18 | 33 | YearsAtCompany | 1377.707 |
Measured with internal n-fold cross validation, I achieve the
lowest RMSE of 1375.36 with the following explanatory variables
JobLevel, TotalWorkingYears,
YearsWithCurrManager, YearsWithCurrManager2,
DistanceFromHome, JobInvolvement,
EnvironmentSatisfaction.
To measure separate RMSEs for a training and a
validation set, I use external cross validation for a linear regression
model of MonthlyIncome using the variables above.
# model formula from variable indexes (16,30,36,7,15,38,12)
model_formula = as.formula("MonthlyIncome ~ JobLevel + TotalWorkingYears + YearsWithCurrManager + YearsWithCurrManager2 + DistanceFromHome + JobInvolvement + EnvironmentSatisfaction")
# number of train/test splits
iters = 100
# split percentage
split_perc = 0.7
# initialize vector to store squared error from each train/test split loop for training and validation sets
cv_error_train = numeric(iters)
cv_error_test = numeric(iters)
# external cross-validation loop
for (i in 1:iters) {
set.seed(i)
# generate the test train split
trainIdx = sample(1:nrow(cs2_regres), round(splitPerc * nrow(cs2_regres)))
IncomeTrn = cs2_regres[trainIdx, ]
IncomeTst = cs2_regres[-trainIdx, ]
# fit the model with train_data
model = lm(model_formula, data = IncomeTrn)
# predict on the training set (same data used to make model)
predTrn = predict(model, newdata = IncomeTrn)
# calculate the mean squared error of training pred
cv_error_train[i] = mean((IncomeTrn$MonthlyIncome - predTrn)^2)
# predict on the training set (same data used to make model)
predTst = predict(model, newdata = IncomeTst)
# calculate the mean squared error of training pred
cv_error_test[i] = mean((IncomeTst$MonthlyIncome - predTst)^2)
}
# calculate the root mean squared error (RMSE) for the training set
RMSE_trn = sqrt(mean(cv_error_train)) # root of avg of error from each iteration
# calculate the root mean squared error (RMSE) for the validation set
RMSE_tst = sqrt(mean(cv_error_test))
# print the results
cat("\nRMSE of training set: ", RMSE_trn, "\n")
RMSE of training set: 1357.371
cat("\nRMSE of validation set: ", RMSE_tst, "\n")
RMSE of validation set: 1381.737
For my linear regression model, the RMSE of my training set and
validation sets are 1357.37 and 1381.74, respectively. To achieve this,
I used these variables, JobLevel,
TotalWorkingYears, YearsWithCurrManager,
YearsWithCurrManager2, DistanceFromHome,
JobInvolvement, EnvironmentSatisfaction. This
list includes YearsWithCurrManager2, which is a variable I
derived from squaring YearsWithCurrManager.
I import the unlabeled competition data set from the
cloud as competition. I create two new variables by
squaring existing variables. So that my column indexes are the same as
those in the primary data set, I add an MonthlyIncome
column. I train a Linear Regression model using the optimized variables
and all the observations in the labeled primary data set.
# make predictions with linear regression model
# variables used: `JobLevel`, `TotalWorkingYears`, `YearsWithCurrManager`, `YearsWithCurrManager2`, `DistanceFromHome`, `JobInvolvement`, `EnvironmentSatisfaction`
# parameters
# train_dataframe = cs2_regres
# adjust index numbers based on missing MonthlyIncome column or add column to competition set
# regresIdx = c(16,30,36,7,15,38,12)
# import data: CaseStudy2CompSet No Salary.csv from AWS S3 msdsds6306 bucket
# first column header is `ï..ID` instead of `ID` which I guess is a UTF-8 encoding issue
# url = "https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2CompSet+No+Salary.csv"
# competition = read.table(textConnection(getURL(url)), sep = ",", header = TRUE, stringsAsFactors = TRUE)
# so I use read.csv and specify encoding = "UTF-8" and it imports correctly.
url = "https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2CompSet+No+Salary.csv"
competition = read.csv(url, header = TRUE, stringsAsFactors = TRUE, encoding = "UTF-8")
# save a copy of the competition unlabeled data
# write.csv(competition, file = "data/CaseStudy2CompSet No Salary.csv", row.names = FALSE)
# get a sense of the data
# str(competition)
# summary(competition)
# create squared variables, both to keep the indexes consistent
competition = competition %>%
mutate(
YearsInCurrentRole2 = YearsInCurrentRole^2,
YearsWithCurrManager2 = YearsWithCurrManager^2)
# check the conversion
# str(competition)
# add the MonthlyIncome column between MaritalStatus and MonthlyRate
competition = competition %>%
mutate(MonthlyIncome = NA, .after = "MaritalStatus")
# str(competition)
# train the linear regression model on the full `cs2_regres` data set
fullset_LM_model = lm(model_formula, data = cs2_regres)
# predict the labels on the competition set
preds = predict(fullset_LM_model, competition)
# add the labels to the MonthlyIncome column
competition = competition %>% mutate(MonthlyIncome = as.integer(preds))
# str(competition)
# for curiousity get the summary of MonthlyIncome from both datasets
summary(cs2_regres$MonthlyIncome)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1081 2840 4946 6390 8182 19999
summary(competition$MonthlyIncome)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1576 2340 6008 6242 7056 19117
# make dataframe with just competition set ordered IDs and labels
competitionLabels = competition %>% select(ID, MonthlyIncome)
# str(competitionLabels)
# save to file
# write.csv(competitionLabels, file = "data/Case2PredictionsHenderson Salary.csv", row.names = FALSE)
The goal of this analysis is to identify factors that lead to
and help predict employee attrition for the CEO and CFO of Frito Lay.
This analysis includes 36 unique features from 870 employees. Using a
combination of visualizations and statistical techniques, I identify
monthly income, job level and over time work as top factors leading to
employee attrition. I also document trends across different job roles
and provide a tool for interactive visualization of trends in income and
attrition rates among these roles. I constructed a model to forecast
employee attrition by identifying variables and parameters that yielded
the lowest error rates. I also developed an optimized linear regression
model for forecasting monthly salaries. With these models, I offer
predictions for 300 employees whose attrition outcomes are uncertain and
a separate cohort whose salaries are undetermined. Importantly, these
data reveal targeted steps that may be taken to mitigate attrition.
Specifically, increase monthly income for lower income earners in job
level 4, reduce over time work across job levels 1, 2, 3 and 5,
prioritize salary hikes and career advancement for personnel in job
level 1, where attrition rates are most pronounced. In providing
actionable intelligence alongside practical tools, my aim is to empower
Frito Lay in cultivating a work environment that promotes the retention
and development of high-value talent.
R and package versions used
devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31)
os macOS Monterey 12.7.1
system x86_64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-04-21
pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.0)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.3.0)
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.3.0)
bslib 0.6.1 2023-11-28 [1] CRAN (R 4.3.0)
cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.0)
caret * 6.0-94 2023-03-21 [1] CRAN (R 4.3.0)
checkmate 2.3.1 2023-12-04 [1] CRAN (R 4.3.0)
class * 7.3-22 2023-05-03 [1] CRAN (R 4.3.2)
cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.0)
cluster 2.1.4 2022-08-22 [1] CRAN (R 4.3.2)
codetools 0.2-19 2023-02-01 [1] CRAN (R 4.3.2)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
combinat * 0.0-8 2012-10-29 [1] CRAN (R 4.3.0)
corrplot * 0.92 2021-11-18 [1] CRAN (R 4.3.0)
data.table 1.14.10 2023-12-08 [1] CRAN (R 4.3.0)
DataExplorer * 0.8.3 2024-01-24 [1] CRAN (R 4.3.2)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.0)
digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.0)
e1071 * 1.7-14 2023-12-06 [1] CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.0)
evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.0)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.0)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
foreach 1.5.2 2022-02-02 [1] CRAN (R 4.3.0)
foreign 0.8-85 2023-09-09 [1] CRAN (R 4.3.2)
Formula 1.2-5 2023-02-24 [1] CRAN (R 4.3.0)
fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0)
future 1.33.1 2023-12-22 [1] CRAN (R 4.3.0)
future.apply 1.11.1 2023-12-21 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
GGally * 2.2.0 2023-11-22 [1] CRAN (R 4.3.0)
ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.0)
ggstats 0.5.1 2023-11-21 [1] CRAN (R 4.3.0)
globals 0.16.2 2022-11-21 [1] CRAN (R 4.3.0)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.0)
gower 1.0.1 2022-12-22 [1] CRAN (R 4.3.0)
gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.3.0)
gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)
hardhat 1.3.1 2024-02-02 [1] CRAN (R 4.3.2)
highr 0.10 2022-12-22 [1] CRAN (R 4.3.0)
Hmisc * 5.1-2 2024-03-11 [1] CRAN (R 4.3.2)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmlTable 2.4.2 2023-10-29 [1] CRAN (R 4.3.0)
htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.0)
httpuv 1.6.13 2023-12-06 [1] CRAN (R 4.3.0)
igraph 1.6.0 2023-12-11 [1] CRAN (R 4.3.0)
ipred 0.9-14 2023-03-09 [1] CRAN (R 4.3.0)
iterators 1.0.14 2022-02-05 [1] CRAN (R 4.3.0)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.3.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.0)
kableExtra * 1.4.0 2024-01-24 [1] CRAN (R 4.3.2)
knitr 1.45 2023-10-30 [1] CRAN (R 4.3.0)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.0)
later 1.3.2 2023-12-06 [1] CRAN (R 4.3.0)
lattice * 0.21-9 2023-10-01 [1] CRAN (R 4.3.2)
lava 1.7.3 2023-11-04 [1] CRAN (R 4.3.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.0)
listenv 0.9.0 2022-12-16 [1] CRAN (R 4.3.0)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
MASS 7.3-60 2023-05-04 [1] CRAN (R 4.3.2)
Matrix 1.6-1.1 2023-09-18 [1] CRAN (R 4.3.2)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0)
mgcv 1.9-0 2023-07-11 [1] CRAN (R 4.3.2)
mime 0.12 2021-09-28 [1] CRAN (R 4.3.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
ModelMetrics 1.2.2.2 2020-03-17 [1] CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
networkD3 0.4 2017-03-18 [1] CRAN (R 4.3.0)
nlme 3.1-163 2023-08-09 [1] CRAN (R 4.3.2)
nnet 7.3-19 2023-05-03 [1] CRAN (R 4.3.2)
pander * 0.6.5 2022-03-18 [1] CRAN (R 4.3.0)
parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.3.0)
patchwork * 1.2.0 2024-01-08 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgbuild 1.4.3 2023-12-10 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
pkgload 1.3.3 2023-09-22 [1] CRAN (R 4.3.0)
plyr 1.8.9 2023-10-02 [1] CRAN (R 4.3.0)
pROC 1.18.5 2023-11-01 [1] CRAN (R 4.3.0)
prodlim 2023.08.28 2023-08-28 [1] CRAN (R 4.3.0)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.0)
promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.0)
proxy 0.4-27 2022-06-09 [1] CRAN (R 4.3.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
RColorBrewer * 1.1-3 2022-04-03 [1] CRAN (R 4.3.0)
Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.3.0)
RCurl * 1.98-1.14 2024-01-09 [1] CRAN (R 4.3.0)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
recipes 1.0.9 2023-12-13 [1] CRAN (R 4.3.0)
remotes 2.4.2.1 2023-07-18 [1] CRAN (R 4.3.0)
reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.3.0)
rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.0)
rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.0)
rpart 4.1.21 2023-10-09 [1] CRAN (R 4.3.2)
rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)
sass 0.4.8 2023-12-06 [1] CRAN (R 4.3.0)
scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
shiny 1.8.1.1 2024-04-02 [1] CRAN (R 4.3.2)
stringi 1.8.3 2023-12-11 [1] CRAN (R 4.3.0)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.0)
survival 3.5-7 2023-08-14 [1] CRAN (R 4.3.2)
svglite 2.1.3 2023-12-08 [1] CRAN (R 4.3.0)
systemfonts 1.0.5 2023-10-09 [1] CRAN (R 4.3.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
timeDate 4032.109 2023-12-14 [1] CRAN (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.0)
usethis 2.2.2 2023-07-06 [1] CRAN (R 4.3.0)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.0)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.0)
viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.3.0)
withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.0)
xfun 0.41 2023-11-01 [1] CRAN (R 4.3.0)
xml2 1.3.6 2023-12-04 [1] CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.0)
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.0)
[1] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
──────────────────────────────────────────────────────────────────────────────